Linear Algebra for Data Science 


Wanmo Kang Kyunghyun Cho 
KAIST New York University 
wanmo.kang@kaist.ac.kr Genentech 


kyunghyun.cho@nyu.edu 


Version 1.0 
July 12, 2024 


Contents 


Preface 
1 Introduction 


2 Matrices and Gaussian Elimination 


2.1 Matrix Operations a ta e ama 02.04.2282 E a we ee ee. 
2.2 Solving Simultaneous Linear Equations... 2... 0.0... 00.0000. 00000000. 
2.3 An Example of Gaussian Elimination. .........0..0.0.0 00000 2 pee ee eee 
2.4 Block Matrices: s s= ccs socana 2 Mie mm tee 
2.5. Inverse of a Matrix see e c aa aN Re he ee ee ee es 
2.6 Triangular Factors and LU-Decomposition............. 2.002.200.2000 00. 
2:7 Inverse of a Block Matrix ... . ye. uA ee ee S 
2.8 Application to Data Science: Graphs and Matrices ...............2.-0-2000. 
3 Vector Spaces 

3.1 Vector Spaces and Subspaces .. 2... 2. . ee 

3.1.1 Operations ina Vector Space... 2... e 

3.1.2 Two Fundamental Subspaces induced by Matrices ...............00.0. 
3:2 ‘Solving: Axa MIAE Bi gto desk ek oe Sve bd Soret ee eh ae Ges 

3.2.1 A Row Echelon FormU ... 2... 0.20.00... 00 ee ee 

3.2.2 Pivot Variables and Free Variables ... 2... ....0..00 0000002 ee eee 
3.3 Linear Independence, Basis, and Dimension ...............00 000004 00s 
3A. Bank ofa Matrix. « 2. el a a gv A era ee, a ww ee a ti a ae G 
3.5 The Four Fundamental Subspaces.... 2... 2... ee 
3.6 Existence of an Inverse... 2... ee 
3:7 Rank-one Matrices: ia s darru oe OR AB oe PE a ED De ee ae ae a els 
3:8: “Linear ‘Transformation: sy saas 4 ee ey SE ee ve EAR 

3.8.1 A Matrix Representation of Linear Transformation .................. 

3.8.2 Interpretable Linear Transformations. ............ 2.020.000 002 ae 


Contents 


3.9 Application to Graph Theory ..... 2... 0.020002 ee 54 
3.10 Application to Data Science: Neural Networks ............0.2.0 00000000 - 55 
3.10.1 Flexibility of Neural Network Representations. .................00. 57 
Orthogonality and Projections 59 
Ad. ‘Inner Products snie a a a Oy es ee ee ee ee oe ae GE 59 
4.2 Orthogonal Vectors and Subspaces .. oaoa a 66 
4.3 Orthogonal Projection . + saaa aa aa aa ce canna ee oll i a a 69 
4.3.1 Projection onto the Direction of a Vector .. oaoa 69 
4.3.2 Projection onto a Subspace Spanned by Orthonormal Vectors ............ 72 
4.3.3 Projection onto a Subspace Spanned by Independent Vectors ............ 73 
4.4 Building an Orthonormal Basis: Gram-Schmidt Procedure. ...............0.. 75 
4.4.1 Gram-Schmidt Procedure for given Linearly Independent Vectors.......... 75 
4.4.2 Projection as Distance Minimization . . . . oaoa 77 
4.5 Decomposition into Orthogonal Complements .............00 00000 2 eee 78 
4.6 Orthogonality in Euclidean Spaces R® 2... 2... ee 79 
4.6.1 Orthogonal Complements of the Fundamental Subspaces............... 79 
4.7 -Orthogonal Matrices . cs n aa oaa c Mis me ae a a a 80 
4.7.1 QR Decomposition .... AN: Ra ee 81 
4.7.2 Isometry induced by an Orthogonal Matrix .................0.000. 82 
ALS: Matri Norms ino ni 4 tara i ae Se nk ed aS Gs 83 
4.9 Application to Data Science: Least Square and Projection .................. 85 
4.9.1 Least Square as a Convex Quadratic Minimization .................. 85 
4.9.2 Equivalence between Least Square and Projection .................. 86 
Singular Value Decomposition (SVD) 89 
5.1 A Variational Formulation for the Best-fit Subspaces ...............-..200- 90 
5.1.1 Best-fit 1-dimensional subspace .... 2... 2... e 90 
5.1.2 Best-fit 2-dimensional subspace... s. ooa ee 91 
5.1.3 Best-fit k-dimensional subspace... s. ooa ee 92 
5.2 Orthogonality of Left Singular Vectors ........0.0..0 0.000000 000000000. 94 
5.3 Representing SVD in Various Forms ... aoaaa ee 95 
5.4 Properties of a Sum of Rank-one Matrices... ........0.0.0 0000000 eee eee 97 
5.5 Spectral Decomposition of a Symmetric Matrix viaSVD................... 100 
5.6 Relationship between Singular Values and Eigenvalues .................00.. 104 
5.7 Low rank approximation and Eckart-Young—Mirsky Theorem ................ 106 
5:87 sPSeUdOINVETSES io) Ae ee Sos ee hh ol he Be, baie oe ek oe eB oh 109 


5.8.1 Generalized Projection and Least Squares... 2... 0.2.0... e 113 


Contents iii 


5.9 How to Obtain SVD Meaningfully . aoaaa aa 00000000000. 114 
5.10 Application to Statistics: Principal Components Analysis (PCA) from SVD ........ 115 

6 SVD in Practice 119 
6.1 Single-Image Compression... oaoa ee 119 
6.1.1 Singular values reveal the amount of information in low rank approximation. ... 121 

6.2 Visualizing High-Dimensional Data via SVD... aaao aa 2.0.00 000. 123 
6.2.1 Left-singular Vectors as the Coordinates of Embedding Vectors in the Latent Space 123 

6.2.2 Geometry of MNIST Images According to SVD ................-.04. 124 

6.2.3 Geometry of MNIST Images in the Latent Space of a Variational Auto-Encoder. . 127 

6.3 Approximation of Financial Time-Series via SVD ................2.-2-2000. 128 

7 Positive Definite Matrices 133 
7.1 Positive (Semi-)Definite Matrices .. 2... 0.020.000 0000 0. 134 
7.2 Cholesky Factorization of a Positive Definite Matrix ..................0.0. 136 
7.3 The Square Root of a Positive Semi-definite Matrix. ..................0.4. 136 
7.4 Variational Characterization of Symmetric Eigenvalues... ............0.00.0. 137 
7.4.1 Eigenvalues and Singular Values of a Matrix Sum................... 140 

7.5 Ellipsoidal Geometry of Positive Definite Matrices .................-.200. 143 
7.6 Application to Data Science: a Kernel Trick in Machine Learning. ............. 144 

8 Determinants 149 
8.1 Definition and Properties ... 2... 2... 2. ee 150 
8.2 Formulas for the Determinant... .. 2... 0.20.0... e 154 
8.2.1 Determinant of a Block Matrix ........0..0.0. 0.000000 eee ee 158 

8.2.2 Matrix Determinant Lemma................. 00.0000. eee eee 159 

8.3 Applications of Determinant... 2... 2.2... 0.020002 ee 160 
8.3.1 The Volume of a parallelopiped in R?” ...................0 2.0000. 160 

83.2. COMMUNE AT. p So Gente: 6 he PG Bie Ae he ee el Bee eee oe ee ES 161 

8.3.3 Cramer’s Rule: Solution of Ax=b........... 0... e 162 

8.4 Sherman-Morrison and Woodbury Formulas ............. 0.2.2.0. 0 00000. 162 
8.5 An Application to Optimization: Rank-One Update of Inverse Hessian. .......... 164 

9 Further Results on Eigenvalues and Eigenvectors 167 
9.1 Examples of Eigendecomposition .. 2... 0... 2 ee 168 
9.2 Properties of an Higenpair ... 2... 1. ee 169 
9.3 Similarity and the Change of Basis... 2... 20. ..0.0 2.00... 0 000000200000. 172 
9.3.1 The Change of Basis... 1.2... a 2... ee 173 


9.3.2 The Change of Orthogonal Basis ............ 0.2.0... 00000000. 174 


iv Contents 


9:337 “SIMU ALIGY: 4 <2 i.e i BAD STS ws SS ee eh ci 174 

9:4. Diagonalization: « ai Diba & $b RAR ae OS PEPE AES Bak hed 175 
9.5 The Spectral Decomposition Theorem ............. 000 eee ee 178 
9.6 How to Compute Eigenvalues and Eigenvectors ... 2... 0.0... 002000000000. 178 
9.7 Application to Data Science: Power Iteration and Google PageRank ............ 179 

10 Advanced Results in Linear Algebra 183 
10.1 A Dual Space... 22 ee a a ee ee ee ee gp ee Me i 183 
10.2 Transpose of Matrices and Adjoint of Linear Transformations ................ 184 
10.2.1 Adjoint and Projection... © aoa e ea aaa ee 187 

10.3 Further Results on Positive Definite Matrices . . .. o oaaao 188 
10.3.1 Congruence Transformations . .. o ooo e 000000. ee eee 188 

10.3.2 A Positive Semi-definite Cone and Partial Order ................... 190 

10.4 Schur Triangularization + . a a ve ee ee Re KM es 194 
10.5 Perron-Frobenius Theorem ... 2... 20.0... ee 195 
10.6 Eigenvalue Adjustments and Applications .......... 2.000.000 0000002 ae 197 

11 Big Theorems in Linear Algebra 199 
11.1 The First Big Theorem: Cayley-Hamilton Theorem...................000. 199 
11.2 Decomposition of Nilpotency into Cyclic Subspaces.................2020004 200 
IT.3.:Nilpotency of A AL pi. in bo Ff ted Le Be ee 204 
11.4 The Second Big Theorem: the Jordan Normal Form Theorem ................ 206 

12 Homework Assignments 209 
13 Problems 217 
13.1 Problems for Chapter 1~4......0...0.0 0.00002 eee ee 217 
13.2) Problems*for Chapter 5 ~ 9... oa eero ee 220 
Bibliography 225 
Appendices 227 
A Convexity 227 
B Permutation and its Matrix Representation 231 


C The Existence of Optimizers 233 


Contents 


D Covariance Matrices 
D.1 Covariance Matrix of Random Vector ........0.0 000 a 
D.1.1 Positive Definiteness of Covariance Matrices. ...............22.20004 
D.1.2 A Useful Quadratic Identity. ©... e eea a a a ee 
D.2 Multivariate Gaussian Distribution ..........0..0.. 000000 eee ee 
D.3 Conditional Multivariate Gaussian Distribution ....................000. 
D4 Multivariate Gaussian Sampling using Cholesky Decomposition ............... 


D.5 Ill-conditioned Sample Covariance Matrices .... 2... 2.20.0... a 
E Complex Numbers and Matrices 
F An Alternative Proof of the Spectral Decomposition Theorem 


Index 


245 


247 


249 


vi 


Contents 


Chapter 1 


Introduction 


In mathematics, most areas deal with various types of spaces and study functions defined between these 
spaces as well as how these functions preserve or do not preserve various properties of these spaces. Linear 
algebra in particular deals with spaces in which objects can be scaled by a scalar and be added together 
and studies functions that preserve such properties. We refer to these objects, spaces and functions 
as vectors, vector spaces and linear transformations, respectively. We use linearity to collectively refer 
to the scalability and additivity of vectors that are preserved under linear transformation across vector 
spaces. This linearity enables us to understand the universal structures of the vector spaces and linear 
transformation. For example, vector spaces of the same dimension are roughly equivalent. We also 
found that a linear transformation can be characterized by a rectangular array of numbers called a 
matrix. Then, to investigate and classify linear transformations, we work with matrices corresponding 
to linear transformations, and this allows us to successfully classify these linear transformations into 
Jordan forms. Such fundamental efforts for classification have led to a variety of by-products that have 
proven to be useful for many real-world applications. In this book, we do not make compromise between 
the fundamental results and useful by-products, by providing readers with gap-free derivations of useful 
by-products from the fundamental results. 

Let us take a look in detail. Two most important objects in describing and studying linear algebra 
are vectors and matrices. We may refer to vectors as any mathematical objects for which vector addition 
and scalar multiplication can be well-introduced. For the vector addition, we denote the identity which 
is usually called a zero vector by 0. We also place a minus sign (-) in front of the original vector to 
denote the inverse of the addition. A vector space is defined as the collection of vectors that satisfy the 
distributive laws between the vector addition and the scalar multiplication. For example, the distributive 


laws with a notational convention of 1v = v enable, for any vector v in a vector space, 
vtv=lv+lv=(141)v = 2v. 
We will dive deeper into the vector space in Chapter 3. 


Kang and Cho, Linear Algebra for Data Science, 1 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


2 Chapter 1. Introduction 


One of the most intuitive yet important examples of a vector is a finite array of numbers. We can 


vertically stack m real values v1, v2,...,Um to form an m-dimensional vector v; 
U1 
U2 
v= 
Um 


where we use v; to refer to the i-th component of v. We express that a vector v is an m-dimensional 
vector by v € R” and often refer to it as either an m-vector or R”-vector. For R’™-vectors, we define the 


vector addition and scalar multiplication by 


UL Wi Vi Tr Wy CU1 

U2 W2 U2 T W2 CU2 
v+w= + = > V= 

Um Wm Um T Wm CUm 


For the vector addition, the zero vector 0 € R™, whose entries are all 0, serves as the additive identity. 
When the dimensionality matters, we use a subscript to emphasize it, such as in Om for the m-dimensional 
zero vector. 

In this book, we define a matrix as a rectangular array of numbers by horizontally concatenating 


R™-vectors. Given n R™-vectors 


a11 a12 Gin 
a21 a22 a2n 

ay = ; a2 = 1t An = ’ 
Am1 Am2 Amn 


we horizontally concatenate them to obtain a matrix A; 


a11 Q12 `>? Qin 

a21 Q22 tt a2n 
A=[aj|a2|--- |an] = 

Am1 Am2 Tee Amn 


We call the rows and columns of a matrix row vectors and column vectors, respectively. When we are 
given the name of a matrix, such as A in this case, we use (A),; or aij to denote the element in the i-th 
row and j-column. We say that A is m x n matrix if the matrix A has m rows and n columns. We add 


two matrices, A and B, by adding each pair of corresponding components from these two matrices. That 


a11 Qiz ens Gin b11 biz? + bin ay, + b11 a2 + b12 ges ain + bin 

a21 Q22 tt a2n b21 bog = bon a21 + b21 a22 + b22 pai dan + ban 
+ = 

Am1 Am2 S Amn bint bm2 gs, bmn Am1 T bmi Am2 T bm2 be ka Amn + bmn 


We define scalar multiplication by multiplying each component with a scalar; 


411 @12 +t: Gin C11 Ca42 *'?  Cūin 

Q21 Q22 *** Q2 Ca21 CQA22 8°  CA2n 
C = 

Am1 Am2 `? Amn Cami CAam2 *** Camn 


We can regard any R™-vector as a matrix of n rows and 1 column. 

We often add various structures and operations to a vector space to use it in practice. For instance, 
it is natural to add matrix multiplication to a vector space of matrices. We will delve deeper into matrix 
multiplication in the next chapter, and here, we consider a simple case of multiplying an m x n matrix 


to an n x 1 matrix, which is equivalent to a R”-vector. Such multiplication is defined as 
n 
Av = viar +++ + Unan = X vjay. (1.1) 
j=l 


This results in an R”-vector, i.e., Av € R™. This vector is a linear combination of the column vectors of 
the matrix A, and v;’s work as the weights/coefficients for this combination. 

It is useful to use matrices and R™-vectors to model data. For instance, we can represent a group 
of n people by horizontally stacking R™-vectors corresponding to their characteristics to form an m x n 
matrix, A. In this matrix, (A);; represents the i-th person’s j-th property. We can compute the average 


of each of these properties as matrix-vector multiplication: 


where 1 = 


1 
As another example, consider the following model of economy. 


e aij: the contribution of material j to the i-th product, aj: an R™-vector whose i-th entry is aij; 
e xj: the amount of material j available, x: an R”-vector whose j-th entry is zj. 


Under this model, we can interpret the linear combination of a;’s with weights of x;’s, i.e., xia +--+ + 


nan = Xai zjaj, as the amounts of m manufactured products produced from the given amounts of 


4 Chapter 1. Introduction 


n raw materials. We can write this more concisely as Ax, with (A);; = aij. Given a target production 
vector b, we can now write the problem of finding the amount of enough raw materials to satisfy the 
target production quantities, as solving for 

Ax =b. 


These examples illustrate how simple it is to use matrices and vectors to reason about data. 
We can naturally and seamlessly introduce and derive various concepts and results in linear algebra 


by solving Ax = b above. More specifically, we will learn the following concepts and results in this book: 


How to solve linear systems: Gaussian elimination; 
e Abstraction and manipulation of data: a vector space and linear transformation; 
e Approximation of data: orthogonality, projection and least squares; 


e Factorization of data: SVD (singular value decomposition), PCA (principal components analysis) 


and pseudoinverse; 
e Shapes of data: covariance, positive definiteness and convexity. 
e Key features of matrices: determinant, eigenvalue and eigenvector; 
e Advanced results for matrices: adjoint, positive definite cone and Perron-Frobenius theorem; 
e Theorems by Cayley-Hamilton and Jordan. 


In data science, it is usual to analyze complex data by projecting their high-dimensional vector repre- 
sentations in a lower-dimensional subspace and investigating the corresponding lower-dimensional vectors. 
SVD is one of the most representative approaches to determining the best subspace for approximating 
high-dimensional data. Additionally, positive definite matrices and their properties are frequently used 
to characterize the relationship within data, both in data science and engineering. In this book, we make 
a significant departure from existing textbooks and lecture notes in linear algebra and go directly into the 
concept of projection, SVD and positive definiteness without introducing eigenvalues nor eigenvectors in 


detail. 


Chapter 2 


Matrices and Gaussian Elimination 


We say that we solve multiple linear equations, when we determine the values of unknown variables that 
satisfy multiple linear equations simultaneously. We can do so by progressively eliminating unknown 
variables from these equations by reading off the values of these unknown variables. This process, to which 
we refer as Gaussian elimination, progressively modifies linear equations without altering the solution of 
the original linear equations. By using matrices, we can describe this process of successive elimination 
of variables from linear equations without referring to variables, signs nor equalities. In other words, 
such Gaussian elimination can be described as a sequence of matrix operations. In addition to addition 
and scalar multiple, we define matrix-matrix multiplication, in order to represent any modification of 
linear equations by Gaussian elimination as multiplying a matrix representing the linear equations with a 
specially-structured matrix. In doing so, we obtain a surprisingly rich set of concepts and mathematical 


results on matrices. 


More specifically, we introduce matrix-matrix multiplication in this chapter. When we multiply a 
matrix with a vector from left, we get the linear combination of the columns of the matrix. When we 
multiple two matrices, then, the resulting matrix consists of columns resulting from multiplying the first 
matrix with the columns of the second matrix, respectively. The inverse of matrix-matrix multiplication 
is called an inverse matrix, and not every matrix has its inverse. We define the transpose of a matrix 
by swapping the column and row indices and introduce a symmetric matrix as a matrix whose transpose 
is itself. Symmetric matrices show up in many places throughout this chapter and the rest of the book, 
as they exhibit mathematically favorable properties. We define lower and upper triangular matrices, as 
matrices whose elements above and below the diagonal are zeros, respectively. With these various types 
of matrices, we show that Gaussian elimination corresponds to multiplying a series of lower triangular 
matrices to the matrix of linear equations to arrive at an upper triangular matrix. We call this process 
LU factorization, and this connects to the process of inverting the matrix of the linear equations. We 


eventually describe this whole process in terms of block matrices. 


Kang and Cho, Linear Algebra for Data Science, 5 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


6 Chapter 2. Matrices and Gaussian Elimination 


2.1 Matrix Operations 


We can extend matrix-vector multiplication (1.1) to matrix-matrix multiplication. Instead of v € R”, 
consider a n x L matrix B = [b,|bo|---| be] = (bjk). Matrix-vector multiplication between A and the 


k-th column of B, bx, is then 


Abk = bikai + b2kaz +++: + bnkan. 


We define matrix-matrix multiplication of A and B by horizontally stacking the resulting vectors; 
AB = | Abı | Abə| ---| Abe]. 


Matrix-matrix multiplication is well defined only when the number of the rows of the first matrix and 
the number of the columns of the second matrix coincide with each other. In other words, multiplying 
an m x n matrix and an n x l matrix results in an m x £ matrix. We can compute (AB),,; in multiple 


ways: 


(AB)ig = bijai +--+ + bpjain 


n 
= X Qikbkj 
k=1 


= [an eee ain| 


Matrix-matrix multiplication is associative, i. (AB)C = A(BC). Matrix-matrix addition and 
multiplication satisfy distributivity, i.e. A(B + C) = AB + AC and (B+C)D = BC + CD. Unlike 
the product of real numbers, however, matrix-matrix multiplication does not exhibit the commutative 
property. It is easy to find two matrices, E and F, such that EF 4 FE. 

The identity for matrix addition is a matrix of all zeros, and we use 0 to denote it. Although it is 
often unnecessary to specify the size of such an all-zero matrix, if necessary, we use the subscript, i.e. 
Omn for an m x n matrix of all zeros. The identity for matrix multiplication is a square matrix, which 
has the same number of rows and columns, whose diagonal entries a;; are 1 and off-diagonal ones are all 


zeros. We call this an identity matrix and use the following notation: 


When necessary, we use a subscript to indicate the size of the identity matrix, as in 


Al, = A = Fok. 


2.1. Matrix Operations 7 


where A is an m x n matrix. Another helpful, special matrix is a diagonal matrix whose off-diagonal 


entries are all zeros; 


D= r = diag(dı,..., dn). 
0 dn 
Diagonal entries may also be zero. When a matrix is multiplied by a diagonal matrix from left, its rows 
are scaled by the corresponding diagonal entries. When multiplied from right, the columns are scaled, 


accordingly. 


It is sometimes useful to transpose a matrix, which is defined as 


A simple example is 


There are two useful properties of transpose in conjunction with matrix addition and multiplication: 


« (A+B)'=A'+B' 
since ((A+ B)"),, = (A+ B)ji = (A) ji + (B) ji = (A Jij + (Big = (AT + Bis; 


e (AB)'=B'A 
since ((AB)")i; = (AB): = 7424 (A)jx(B) ai = pa (B(A aj = (BTA). 


It is natural to extend matrix transpose to vector transpose. Since a R”-vector a can be thought of 


T is correspondingly thought of as an 1 x n matrix. If we use this in the context of 


T 


as an n x 1 matrix, a 
matrix-vector multiplication (1.1) with an 1 x n matrix, i.e., A =a! , Av =a! v results in an 1 x 1 matrix 


which is a real-valued scalar. In such a case, we do not write it as a matrix but simply as a scalar; 
n 
aly = 5 QjUj. (2.1) 
j=1 


We call this (standard) inner product of a and v and will discuss it more in detail later when we introduce 
the notion of inner products in a vector space (Definition 4.3). With this definition of inner product, 
we can view matrix-vector multiplication as repeatedly computing the inner product between each row 
vector and the vector. 

Now that we know what the transpose of a matrix is, we can think of a matrix that is invariant to 


the transposition. We call such a matrix a symmetric matrix. 


Definition 2.2 A is symmetric if A' = A. 


8 Chapter 2. Matrices and Gaussian Elimination 
Symmetric matrices possess many desirable properties and have been an important object of investi- 
gation in linear algebra. Some simple properties of symmetric matrices include; 
e Any symmetric matrix is square. 
e Every diagonal matrix is symmetric. 
e For any matrix A, both A! A and AA! are symmetric. 
e ADA' and A'DA are symmetric when D is a diagonal matrix. 


We encourage you to think of how these properties hold. We will introduce you to a richer set of properties 


of symmetric matrices throughout the rest of the book. 


2.2 Solving Simultaneous Linear Equations 


There is a close relationship between solving simultaneous linear equations and manipulating matrices. 
Consider the following system of two linear equations. We can represent the same system using matrix- 
vector multiplication as well as a single matrix: 
(equation 1) lw + ly = 5 1 1 x 5 1 1 5 
= 
(equation 2) 2a — ly = 1 2 -1] ly 1 2 -1 1 


How do we solve these linear equations? First, we multiply both sides of the first equation by two 
and subtract it from the second equation. This process, to which we often refer as Gaussian elimination, 


is equivalent to multiplying the 2 x 3 matrix on the right-hand side above with a special matrix, called 


1 0 
an elementary matrix, from left. In this particular example, the elementary matrix is and 
—2 1 
represents the process of eliminating the first variable x from the second equation: 
. À . le + ly = 5 
(equation 2) - 2x (equation 1) > (equation 2): 
= sy = -9 
1 oO; jl 1 5 1 1 5 
< = 
—2 1| |2 —1 1 0 -3 —9 


After this step, we determine the value of the second variable y by 
ay =EN y= 3: 


After substituting y with 3 in the first equation, we can determine the value of the first variable x as 


r+y=5,y=3 e£=5-3=2. 


The Geometry of Linear Equations 


Unlike when there are three or more variables, it is possible for us to investigate the geometry behind 
linear equations when there are only two variables. More specifically, we can interpret the geometry of 


linear equations from two perspectives. 


2.2. Solving Simultaneous Linear Equations 9 


Row-wise interpretation. We can plot the solution curve of each equation (i.e., a curve over which 
the equation is satisfied). A point where these two curves meet corresponds to the variable values that 
satisfy both equations. In the matrix-vector notation, this corresponds to comparing the matrix-vector 


multiplication of the row vector of the coefficient matrix and the variable vector against each element of 


; f x +y=5 1 1 x 5 
the vector on the right-hand side: <> = 

2 — yo Sd 2 —1j| ly 1 
y 
4 

srHy=5 
2r—y=1 
intersection of 2 lines: 
3 


> T 


Column-wise interpretation. According to (1.1), the left-hand side of the matrix-vector notation can 
be thought of as computing the linear combination of the column vectors of the coefficient matrix A with 
the variables serving as linear coefficients. In this case, we can consider solving linear equations as finding 
the linear coefficients that result in the right-hand-side vector. 

t +y=5 1 1 5 


II 
m 
N 
l 
= 
— 


2a —- y 


linear combination of 2 column vectors: 


> £ 


3w 


With these in our mind, let us extend this two-variable example into a system of m equations with 


n variables. We now know that we can represent this system using an m x n coefficient matrix A, an 
n-dimensional variable vector x and an m-dimensional vector b as Ax = b. Just like in the two-variable 
case above, we can interpret the solution to the system x as the intersection of m hyperplanes in R”, 
represented by the m rows of A, or as the combination of n column vectors of A in R™. Based on how 
those m hyperplanes are arranged relatively to each other, there may be either one unique solution, no 
solution or infinitely many solutions. From the column-wise interpretation, we need to define the concept 


of linear independence of vectors, which we will define later in Definition 3.4, in order to determine the 


10 Chapter 2. Matrices and Gaussian Elimination 


existence of and the number of solutions. In the example above, the column vectors, v and w are linearly 
independent, and therefore for any vector b on the right-hand side, there exists a unique solution. Later, 
we will show more generally that there exists a unique solution for any given b when there are at least 


m linearly independent column vectors in the coefficient matrix A. 


2.3 An Example of Gaussian Elimination 


Let us consider the following system of three equations and three variables: 


2u +v +w = 5 
4u —6v = —2 
—2u +w +2w = 9. 


2 1 1 5 
We can write this system more concisely as a matrix: | 4 —6 0 —2). 


—2 7 2 9 
How would we solve these equations? We eliminate the first variable from the second equation and then 


eliminate the first and second variables from the third equation. We then determine the third variable 
from the third equation (because we have eliminated the first two variables already,) and plug it into the 
first and second equations, after which we can determine the rest of the variables. This whole process 
of progressive elimination can be expressed as a series of multiplication from left by so-called elementary 
matrices, where the elementary matrix is defined as an identity matrix with only one off-diagonal entry 
set to a non-zero number. For instance, if we multiply A from left with an elementary matrix E that has 


(E)ij =b with i > j, we end up with a matrix that satisfies 
e All rows of EA are identical to those of A except for the i-th row; 


e The i-th row of EA equals to the sum of the j-th row of A scaled by b and the i-th row of A. 


1 0 0 
1. (equation 2) - 2 (equation 1) > (equation 2) <= left multiplication of |—2 1 0 
0 0 1 
2u +v +w = 5 1 0 0 2 1 1 5 2 1 1 5 
—8v -2w = -12 & |—-2 1 0 4 -6 0 —2| =| 0 —8 -2 —12 
—2u +w +2w = 9 0 0 1) |-2 7 2 9 —2 7 2 9 


0 0 
2. (equation 3) + (equation 1) > (equation 3) < left multiplication of |0 1 0 
0 1 


2.3. An Example of Gaussian Elimination 11 


2u +v +w = 5 1 0 0 2 1 1 5 2 1 1 5 
—8v -2w = -12 & |0 1 0 0 8 -2 12| = |0 -8 2 12 
8& á +3w = 14 1 0 1] |-2 7 2 9 0 8 3 14 
1 0 
3. (equation 3) + (equation 2) > (equation 3) <= left multiplication of |0 1 
0 1 1 
2u +v +w 5 1 0 0|]|2 1 1 5 2 1 1 5 
—8u -2w = -12 = |0 1 O;} [0 -8 2 12| = |0 -8 2 12 
w = 2 0 1 1 8 3 14 0 0 1 2 


More concisely, we can write the whole process above as successive multiplication of three elementary 


matrices from left (be conscious of the order of the elementary matrices): 


1 0 O; Jl 0 0 1 0 0 2 1 1 5 2 1 1 5 
O 1 0j |O 1 Oj |-2 1 0 4 —6 0 2 0 8 2 12 
O 1 1} ]1 01 0 0 1; |-2 7 2 9 0 0 1 2 


We call this process Gaussian elimination. After Gaussian elimination, the resulting i-th equation should 
have all the variables up to the (i — 1)-th one eliminated. Equivalently, the resulting coefficient matrix 
C = (cij) satisfies ci = +++ = CGi{i—1) = 0. Such a matrix is called an upper triangular matrix, because 
non-zero elements only exist in the upper triangular region of the matrix. We can similarly define a lower 
triangular matrix. 

Once we have an upper triangular coefficient matrix, we can determine the solution readily by back- 
substitution. In the example above, we determine the values of the variables, starting from the final one 


to the first one by sweeping through the equations from bottom to top. 


Row 3: w=2 
=> Row 2:—8v = —12 + 2w = —8, v=1 


=> Row1:2u=5-v-w=5-1-2=2, u=l1. 


Gaussian elimination may fail due to one of the following reasons. 


e Non-singular case (fixable by row exchange): We may end up eliminating too many variables in 
the second row and cannot perform elimination in the third row. In this case, we simply exchange 
the second and third rows. This works because the order of equations in a linear system does not 


change the problem. 


12 Chapter 2. Matrices and Gaussian Elimination 


e Singular case (not fixable): if a row is a scalar multiple of another row, Gaussian elimination results 
in a row with all zeros. In this case, there may be either infinitely many solutions or no solution, 


depending on the right-hand-side vector, and we cannot fix it to have a unique solution. 


1 1 1 1 1 1 1 1 1 
2 2 5| = ]J|0 0 3|= J0 0 3 
4 4 8 0 0 4 0 0 0 
2.4 Block Matrices 
ui V1 
Let us write two vectors, u € R™*” and v € R™?t™ as u = and v = , where 
u2 v2 


ui, vı € R™ and ug, v2 € R™. The inner product of these two vectors can then be written as 
T 


ui V1 
ulv= = Ww vı et ul və. 


u2 V2 
We can furthermore express it as matrix multiplication by treating u' and v as a 1 x (nı + n2) matrix 


and an (nı +72) x 1 matrix, respectively: 


Vi 

a 

u v= [uy u7] = [agv + uJ va] . 
v2 


We can generalize this observation by considering a matrix A = [An 42] instead of u, where Aj, and 
Ajg are m X nı matrix and m x ng matrix, respectively. This matrix-vector multiplication, Av, can then 


be understood as the sum of two vectors from matrix-vector multiplication, A11V1 + A12V2: 


Vi 
Av = [An 412] = AÁı11ıVı + Ajovo. 
v2 
B 
We can further replace v with an (nı + n2) x £ matrix B = = , which results in the following 
Boi 
expression for matrix multiplication between A and B: 
Bıı 
AB = [An Anz È = An Bi + Ai2Bar. (2.2) 
21 


This procedure applies equally well even when the order of A and B is swapped:! 


AB AB 
[Bus Bu] _ |4uBu AnDi (2.3) 
Ao Ao, By, Aoi Bis 


Recall that matrix multiplication is not commutative. 


2.4. Block Matrices 13 


Example 2.1 When we encounter a matrix representing some data, such a matrix often exhibits a 


structure within it. For example, the following matrix is symmetric with diagonal blocks of zero entries, 


0 B 
A= z . Such a structure can be used to facilitate the analysis of data behind the matrix. As 
B 0 
another example, consider the following matrices which include blocks of heterogeneous data, and their 
product. 
1 2 —5 c+2d e+2f 
-l c e 
3 4 = —11 3c+4d 3e+4f 
—2 d f 
a b —a— 2b ac+bd ae+bf 


By grouping elements of the same type into a block and applying (2.3), we see that the diagonal blocks 
correspond to the products of the blocks of the same type, while the off-diagonal ones to the products 
of two blocks of two separate types, which provides us with a new perspective into the product of two 


original matrices. 


1 2 —1 1 2 cC e 
—5 c+2d e+2f 


3 4 =| Ver e —11| |3c+4d 3e+4f 


la ol la ol a: la al [-a—20] fac+bd ae +f 


We can multiply two block matrices, each of which consists of more than two submatrices, by re- 
cursively applying (2.2) and (2.3) above. Let A be a block matrix consisting of smaller matrices Ajj as 


follows: 


where the sizes of these sub-matrices are Ay, : M1 X n1, Ajo: M1 X Ng, A21 : Ma X N1, Age : M2 X No, and 
A: (mı + mg) X (nı + n2), respectively, for positive integers m,;’s and n,’s. Similarly, let B be a block 


matrix consisting of appropriately sized sub-matrices B;j: 


B= Bıı Bı? 
Bo, Bog 
Then, 
A A B B A11 By, + A12 B Ay, Big + A2 B. 
agp- |n 4r u Biz| _ | Au Bu 1282, Ai Bie 12-322 | (2.4) 
Ao, Age} | Boi B22 Ao, By, + Az2Bo, A21 B12 + A22 B22 


It must be satisfied that A;;B;, is well-defined, for this matrix multiplication to hold. 


14 Chapter 2. Matrices and Gaussian Elimination 


To see the similarity between the block matrix multiplication and usual matrix multiplication, let us 


compare (2.4) with the multiplication of two 2 x 2 matrices: 


Q11 Q12 bii dip a11b11 + a12b21 @11b12 + @12b22 


a21 Q22 bo, b22 a21b11 + a22b21 @21b12 + a22b22 


All indices of components in both cases coincides exactly. We have to however keep in mind that the 
order of blocks in each component of the resulting matrix must be strictly as it is in (2.4): neither 
By Ay + A12 Bo, nor A11 B11 + Bg, Aj2 can replace A11 B11 + A12 B21, due to the lack of commutativity 
of matrices. This is unlike usual matrix multiplication, where a11b11 + @a12b21 = 011411 + a12b21 = 
a11b11 + b21012. 

For later use, let us consider powers of a block upper triangular matrix. When a block matrix consists 
of square diagonal blocks and all components below the diagonal blocks are zeros, we can express the 


k-th power of this block matrix in a simple form, as shown in Fact 2.1. 


Fact 2.1 Let a square matrix A be 1| where B and D are square matrices. Then, A® = 
0 
BE Ok 
for some Cy’s. 
0 DF 
B? BC, +C,D pes e 
Proof: A? = : 1| with C2 = BC, + CD. If we assume AP-! = mae 
0 iv 0 p~l 


Br BCk—ı + CDr- 
0 D* 


Ak = where Ck = BCk-1 + Op, | 


2.5 Inverse of a Matrix 


For matrix addition, an all-zero matrix is the identity of addition, and a matrix of which each element’s 
sign is flipped is the inverse of addition. We have also learned of the identity matrix for matrix multipli- 
cation. In this section, we now study the inverse for matrix multiplication. We first define the inverse of 


a matrix as follows: 


Definition 2.3 Let A be an m x n matriz. A matrix B is a left-inverse of A if BA =I, and C 


is a right-inverse of A if AC = Im. If B is both left-inverse and right-inverse of A, then we say 


that A is invertible and has an inverse. 


There are a few interesting observations derived from this definition: 


e If Bisa kx £ matrix, then n = k and £ = m for AB and BA to be well-defined. That is, B has to 


be of size n x m. In fact, it must be m = n if B is an inverse of A, as we will show later. 


2.5. 


Inverse of a Matrix 15 


Inverse is unique if it exists: if both B and C are inverses of A, then 
B= BI, = B(AC) =(BA)C=1,C=C. 


We denote the inverse of A as ATH. 


A useful fact: keep this in your mind as we will use it frequently throughout this book. 


Bb is a solution of Ax = b if B is a right-inverse of A. 


A caution on using the left-inverse while solving Ax = b: Assume that a left-inverse B of 
A exists. Then, for Ax = b, left multiplication of B to both sides results in B(Ax) = Bb and 
consequently x = Inx = (BA)x = B(Ax) = Bb. However, for x = Bb, Ax = A(Bb) = (AB)b may 
not reproduce b unless B is a right-inverse of A. This case frequently happens in regression analysis 
in statistics, and we are often satisfied with Bb as an approximate solution. A typical example of 
a left-inverse which may not be a right-inverse is pseudoinverse in in Fact 5.10, which can be used 


to derive an approximate solution to such a regression problem. 


1 2 
Example 2.2 Consider Ax = b when A = and b = . 1x 2 matrices B = |4 5], [1 0], 
1 1 


and [0 1] are left-inverse matrices of A, since BA = [1]. However Bb = [3], [1], and [2] do not solve 


Ax = b,? and A has no right-inverse. Among these multiple left-inverses, it is a standard practice 


to choose the pseudoinverse [4 5] in Fact 5.10 in regression analysis. | 
AL 
a b j d —b : 
If ad — bc £0, then = sabe . Check yourselves that this holds yourself. 
c d —c a 
dy 0 ity 0 
The inverse of a diagonal matrix is also diagonal: A = me a AS 
0 dn 0 d}! 


if d; £0 for all i. If d; = 0 for some i, (AB);; = 0 for j = 1,...,n and AB Æ I, for any B. That 


is, A is not invertible. 


If both A and B are invertible, then (AB)~! = B~1A7!. Check yourselves that (AB)(B~1A~+) = T 
and (B-1A-1)(AB) = I. 
(A-1)T = (AT)-1, since AT(A-)T = (A-1A)T = IT =. 


If A is symmetric and invertible, then A~! is symmetric since (A71)! = (A')-t = Act. 


In fact, there is no solution that satisfies Ax = b. 


16 Chapter 2. Matrices and Gaussian Elimination 


There is a simple yet useful observation on the inverse of a triangular matrix. It is particularly useful 


to familiarize yourself with proof techniques behind this result. 


Theorem 2.1 Assume A is an upper-triangular matrix. Then, A is invertible if and only if every 


diagonal element of A is non-zero. AT! is also upper-triangular if it exists. 
g g 


Proof: We use mathematical induction on n. As the induction hypothesis, we assume that the statement 

holds for matrices of size smaller than n. We note that this holds for all 1 x 1 matrices trivially, since 
EE SAE ʻi PET r è da-i u 

any 1 x 1 matrix is invertible if it is not zero. Let an n x n upper-triangular matrix A = b , 
Oi a 

where A,_; is an (n—1) x (n—1) upper-triangular matrix, u € R”~!, and a € R. We may use 0 instead 


of 0—1 for brevity. 


e “only if’: Assume that the n x n upper triangular matrix A is invertible. If a = 0, then the last 


row of A vanishes, which makes the last row of AB vanish as well regardless of B. Because this 


Bn-1 W 
contradicts to the invertibility of A, a 40. Let B = * be an inverse of A. From 
w b 
Ag- ul (Bai Y An_tBn-1-+ uw! An—ıVv +ub 
AB= 1 i — 1 1 1 =i 
o al] w b aw! ab 


we need aw! = 0', which implies w = 0 from a #0. Then uw! = 0n-1,n-1 and the first block of 
AB has to satisfy An-1Bn-1 = In_1. On the other hand, 
B-a Ani u By-i1An-1 Bn—ıu + va 
Base 1 V ï = 1 1 1 =L 
o Pn T a o' ba 
also implies Bn-1An—-1 = In—1. Therefore, A,_1 is an invertible upper-triangular matrix of size 
n — 1, and its diagonal should be non-zero by the induction hypothesis. Combining with a Æ 0, all 


diagonals of A are non-zero, and the “only if’ statement holds for matrices of size n. 


e “if’: Assume that A has non-zero diagonals. Then, A,;,_, is invertible by the induction hypothesis 


since Ay; is an (n — 1) x (n — 1) upper-triangular matrix with non-zero diagonals. If B = 


ey "a E 1 è HE é 
7 : , then BA = AB =I, and A~* = B. That is, A is invertible. 
0 a 
Ag-i : —a Ayi lu 1 
If A is invertible, A7! = . Since A,_; ` is upper-triangular by induction, 
oT —1 8 
a 
A`! is also upper-triangular. E 


If you recall the relationship between the inverse and transpose of a matrix, you also see that Theorem 


2.1 applies equally to lower-triangular matrices. 


2.6. Triangular Factors and LU-Decomposition 17 


2.6 Triangular Factors and LU-Decomposition 


It was not a coincidence that elementary matrices used in Gaussian elimination, which were multiplied 


to the coefficient matrix from left, were lower-triangular. Let us consider the following case of Gaussian 


elimination. 
2 1 1 5 2 1 1 5 2 1 1 5 2 1 1 5 
4 -6 0 -2 @ 0 8 2 12 Q 0 8 2 12 Q 0 8 2 12 
—2 7 2 9 —2 7 2 9 0 8 3 14 0 0 1 2 


In this process of elimination, we have used the following three low-triangular matrices: 


1 0 0 1 0 0 

(): (equation 2) - 2(equation 1), I,=|-2 1 0l, L= bi =]2 1 0 
0 0 1 0 0 1 

1 0 0 0 0 

@: (equation 3) + (equation 1), Žo = |o 1 ol, Le=Lz'=|0 1 0 
1 0 1 -1 0 1 

1 0 0 1 0 0 

(3): (equation 3) + (equation 2), Čs = |o 1 oļ, l3=L3'=]0 1 0 
0 1 1 0 -1 1 


We obtain a single lower-triangular matrix that represents Gaussian elimination by multiplying these 
lower-triangular matrices successively. Check for yourself that multiplying multiple lower-triangular ma- 
trices results in a lower-triangular matrix. In Gaussian elimination, we alter the i-th row by adding the 
linear combination of the upper rows, i.e. the first to (i — 1)-th rows, to the i-th row itself. All the 
elementary matrices that correspond to these changes and their product then result in lower-triangular 
matrices whose diagonal entries are all 1’s. In this particular example, the resulting lower-triangular 


matrix and its inverse are 


1 0 0 1 0 
L= gll = |-2 1 0|, L=L'=L£71fg'by1=Iilelg=|2 1 0 
=f 1 1 -1 —1 1 


Gaussian elimination turns the coefficient matrix into an upper-triangular matrix, and in this example, 


this matrix is 


18 Chapter 2. Matrices and Gaussian Elimination 


After multiplying both sides by the inverse of the lower-triangular matrix, we get 
-1 


1 1 1 0 0 2°11 1 
4 -6 0f = |-2 1 0 0 -8 -2 
—2 7 —1 1 1 0 0 1 


-1 —1 ıl lo 0 ı 
LU, 


II 


resulting in the product of the lower-triangular matrix L and the upper-triangular matrix U. We call 
this the LU-decomposition. The upper triangular matrix can be further decomposed as DU with an 
invertible diagonal matrix D and another upper triangular matrix U of which first non-zero entry is 1 for 
each row, and we have A = LDU. It is called LDU-decomposition of the matrix A. The matrix in the 


above example has LDU-decomposition of 


1 oil 1 0 oO; J2 0 Of} J1 1/2 1/2 
4 -6 0f = 2 1 O; JO -8 O} |O 1 1/4 
22 T —1 —1 1] |O O 1] JO O 1 


Unlike in LU-decomposition, the upper triangular matrix U in LDU-decomposition may not be invertible. 

With LU-decomposition, we can solve the corresponding system of linear equations for any b by 
back-substitution as in 

Ax=b => LUx=b or Ly=b 
> y= Ltb 
> Ux=y=L!b 
> eS Ee tb. 
As we mentioned before, linear systems with upper or lower triangular coefficient matrices can be effi- 
ciently solved by back-substitution. 

We can use LU-decomposition to compute the inverse of the coefficient matrix A. If U is invertible, 
A-t = UTILI. We can thus perform Gaussian elimination on an n x 2n expanded coefficient matrix 
[A|Z], so that the first half results in an identity matrix, which transforms the latter half (the identity 
matrix) into the inverse of A: 

UL [AI] = U~ [U|£] = [IUE] = [7|A7*). 


As an example, consider computing the following inverse: 


2.6. Triangular Factors and LU-Decomposition 19 
First, we augment the original coefficient matrix A by attaching an identity matrix, resulting in 

1 0 0 
4 -6 0 0 1 0 

0 0 1 
We then multiply this augmented matrix with the lower-triangular matrix L from Gaussian elimination, 
which transforms the original coefficient matrix A into the upper-triangular matrix U. We continue by 


multiplying this matrix again from left with U~', making the coefficient matrix into a diagonal matrix. 


Finally, we turn it into an identity matrix by multiplying it with the inverse of this diagonal matrix. 


f° 0o olf2 ı1ıļı00 2 1 1 1 00 
-2 1 4 -6 0 |o ı ol=l0 -8 -2 21 0 
a -2 7 Be tor A Oe ke ay | Seay Gal 
1 1⁄8 Os. ı ı 1 00 2 0 3/4 | 3/4 1/8 0 
> NO ı oļlļo -8 -2 | —2 SS Nie -8 -2 2 1 0 
o o A Oe o0 1 | -111 0 0 1 Ep. eg i 
1 0 —3/4| [2 0 3/4 | 3/4 1/8 0 2 0 0 | 3/2 —5/8 —3/4 
> ļoı 2 || WO -8 -2 | -2 ı ol=ļ0 -8 0 | -4 3 2 
0o00 all koe o0 ı -14 N 1 0 0 1l -1 1 1 
1/2 0 oļ2 o oO | 3/2 —5/8 -3/4 1 0 0 | 3/4 —5/16 —3/8 
> |o -1/8 ol jo -8 o | St 3 2 |=ļ0 1 0 | 1/2 -3/8 -1/4 
o 0 aljo o 1] -ıı at 1 001| -1 1 1 


This matrix, resulting from the concatenation of the original matrix and the identity matrix, is 
equivalent to the product of all the matrices that were multiplied from left. Because these multiplications 
transformed the coefficient matrix A into the identity matrix, this resulting augmented matrix represents 
the inverse of the coefficient matrix. That is, 

3/4 —5/16 —3/8 
At = |1/2 —3/8 —1/4 
—1 1 1 


In short, this whole process can be expressed succinctly as 


A™ = (LDU) SU Dae 


although it requires Gaussian elimination for us to eventually obtain L, D and U. 


Uniqueness of LU-Decomposition 


If a matrix A can be decomposed as A = LDU where triangular matrices have unit diagonal entries 


and the diagonal matrix has non-zero diagonals. Then, we can show that this decomposition is unique. 


20 Chapter 2. Matrices and Gaussian Elimination 


Consider two possible ways to factorize a matrix; Lı D1Uı = Lə D2U2. If we move Lə and U; to the right- 
hand side and the left-hand side, respectively, we get Ly 1 L,D, = DUU}. We can further move 
Dy, to the right-hand side to obtain Ls hy = Des, 7! D17}. According to Theorem 2.1, Do" L; and 
U2U,~' are respectively lower- and upper-triangular matrices. Furthermore, all the diagonal entries of 
L`! L; are 1’s. The left-hand side and the right-hand side of L271 Ly = D2U2U1 7! D17} are respectively 
lower- and upper-triangular matrices with unit diagonal entries, because triangular matrices continue to 
be triangular even when multiplied with diagonal matrices. Both sides must be thus identity matrices. 
From this (L2~!L, = J), we arrive at Lı = L2, Dı = Də and then U; = Up. 

Let us further assume that A is a symmetric matrix and factorized into A = LDU without any 
row swaps. Due to the symmetry of A, LDU = U'DL" holds. According to the uniqueness of LU- 


decomposition above, U = L', meaning that we can factorize Aas A = LDL’. 


LU-decomposition with Row Exchanges 


We can perform Gaussian elimination on a matrix, such as the one below, that would not admit Gaussian 


elimination in its original form: 


A= 


We do so by multiplying A from left with the following matrix, which results in swapping the first and 


second rows: 


Q= 


We get such a matrix by (repeatedly) swapping two rows of the identity matrix and call it a permutation 
matrix. Each row and column of a permutation matrix has exactly one 1 each, and all the other entries 
are 0’s. (Refer to the Appendix B for the details of permutation matrices.) This results in the following 


matrix QA: 
0 1| JO 2 5 
QA — — ; 
1 O} |4 5 2 


which then can be factorized into QA = LU using LU-decomposition. 


2.7 Inverse of a Block Matrix 


Let us consider the following 2 x 2 matrix A with non-zero a11: 


3If we multiply A from right, it will swap the first and second columns instead. 


2.7. Inverse of a Block Matrix 21 


To eliminate a21, we multiply an elementary matrix to the left of A as 
1 0 aj44 Q12 a11 a12 
—aznaj} 1| [azn az 0 az — anaj; a2 
Then, to convert a1; into 1, we scale the first row by multiplying a diagonal matrix to the left of A as 
an 0 a11 a12 1 ay} a12 
0 I 0 a22 — 2107] 012 0 a22 — 21071 a12 
These two operations are achieved by multiplying once the following matrix 
ay, 0 ‘i 0 a Ü 
0 1 =ü 1 =ü a 1 


This matrix representation of Gaussian elimination can be extended to block matrices. 


Let A be a square matrix that can be represented as a block matrix as follows: 


where A,; and Aə are also square matrices. If Aj; is invertible, we can eliminate Aj; using Gaussian 


elimination. We can illustrate this process by matrix multiplication: 


Ay 0 Ay Ar| {lu Ay Az (2.5) 
—An Aj; In2| | Agi Age 0 Ag - Ao Ay A12 
Here J; is an identity matrix of the same size as A11. Similarly, we can eliminate Ajj by Gaussian 


elimination if Ago is invertible, which can be expressed as 


Thy =Å Ay, Aig O Ay — A12 A33 Á21 0 (2.6) 

0 AS Ag, Ag A33 Å21 T22 
Let Soo = A22— A21 Ay} A12 to simplify the right-hand side of (2.5). We call S22 a Schur complement 
Iı Aj Are 


of A, with respect to A.* The right-hand side of (2.5) simplifies to 
0 S22 


, and with invertible 


S2 we can perform Gaussian elimination further as follows:5 


hı -AAS | |a AT Ate hy 0 
0 Sa 0 S22 0 In 


By plugging in (2.5), we get 


li — A7] A129532 An 0 Aj Ajo lii 0 
0 Son" -AnA 2| | Aor Age 0 In 


4Similarly, with an invertible A22, a Schur complement of A22 with respect to A is S11 = A11 — A12A55 A21. 
5In other words, perform the following replacements; A11 < I11, A12 <= Ay A12, A21 <= 0, A22 < S22, and apply (2.6). 


22 Chapter 2. Matrices and Gaussian Elimination 


from which we observe that the inverse of the original coefficient matrix A is the product of the two left 


matrices: 


Ao, A22 0 Soe -AnA l2 


= Ay + Ay} A1283 An Ay — Az} A12839 (2.7) 
—S5 A21 AT S 


As we have demonstrated, two equations (2.5) and (2.6), arising from Gaussian elimination, are useful 


for performing thought experiments on various types of matrices in the form of block matrices. 


2.8 Application to Data Science: Graphs and Matrices 


We mathematically express as a graph or network the relationship between multiple objects in for in- 
stance a social network as well as engineering systems. We call v;, in the figure below, a node, and you 
can imagine any object, that can have a relationship with other objects, as nodes, such as a person, orga- 
nization, machine and computer. When two nodes are related to each other, we connect these two with 
a line and call this line an edge. It is intuitive to visualize such a graph but is challenging to manipulate 
it. We thus express a graph as a matrix to compute the properties of and perform various manipulations 


of the graph. 


With n nodes in a graph, we create an n x n matrix. Each row/column of this matrix corresponds 
to one node in the graph. If there is an edge between the i-th and j-th nodes, the (i, j)-th element 
of the matrix takes the value 1. Otherwise, it is set to 0. Because we are consider an edge without 


directionarlity, this matrix, called an adjacency matrix, is symmetric. 


2.8. Application to Data Science: Graphs and Matrices 23 


As an example, let us convert the graph above into the adjacency matrix: 


O o O e.e e.e e O O 
oO o o e e e O O 
oOo e. e OO OF nme 
oOo n.e o oomme 
oF FOO OF jH 
Fe o OF e e O O 
e O O e e e O O 
G = m O ĉ O GOG OG 


By reordering the nodes, to be v3, v4, U5, U8, V1, U2, U6, U7, we get the following block matrix as an 


adjacency matrix: 


1 1 1 1 

1 1 1 1 

1 1 1 1 

1 1 0 B 

1 1 1 [BT 0 
1 1 1 
1 1 1 1 
1 1 1 Ẹ 


By examining this adjacency matrix, we observe that there are two groups of nodes, {v3, v4, U5, Us} 
and {v1, v2, v6, v7}. There is no edge between nodes within each of these groups, but there are edges 
that connect nodes from these two groups. We call such a graph a bipartite graph. This is a simple 
demonstration of how we read out various properties of a graph by analyzing the graph’s adjacency 
matrix. 

Such analysis of an adjacency matrix is widely used in various disciplines, including applied math- 
ematics, engineering and social sciences. In doing so, it is a usual and helpful practice to express and 
analyze an adjacency matrix as a block matrix. 

When a relationship between two nodes is directed, it is usual to name each edge, which is often 


referred as an arc, directly and distinctly, as shown in the figure below. 


24 Chapter 2. Matrices and Gaussian Elimination 


When edges are directed, we have a directed graph, to which we often also refer as a network. Such 
a directed graph can be expressed as an incidence matrix. Each row of the incident matrix corresponds 
to each node, and each column to each arc. We set a;; to 1 if the j-th arc starts at the i-th node, and 
to —1 if the j-th arc terminates at the i-th node. Each column of any incidence matrix thereby has two 
non-zero elements; one 1 and one —1. An incidence matrix A corresponding to the directed graph above 


is then 


—1 -1 

Consider an example where this directed graph represents a network of airports, and each arc is 
associated with x; that represents the number of flights from the i-th airport to the j-airport each 
day. x = (xj) is then a vector representing all the flights in the sky each day. Ax in turn represents 
the differences between the incoming and outgoing flights at all airports. For instance, let b satisfy 
bı > 0, bg > 0, bg = — (bı + b2) < 0,b3 =--- = by = 0. When x satisfies Ax = b, x corresponds to having 
all flights from the first two nodes, vı and ve, eventually fly out to the final node vg without any loss of 
the flights in-between. Of course, we want to constraint xj to be non-negative in practice. This is an 


example of using an incidence matrix to express the network flow and for representing the conservation 


of the flow, i.e. Ax = b. 


Chapter 3 


Vector Spaces 


Imagine a familiar physical quantity of force, which most of us have learned about earlier in our education. 
Given an object, we can apply force to manipulate this object. We can also apply two different forces to 
the object simultaneously, which would be equivalent to applying the sum of two forces, or the addition 
of these forces, to the object once. We can apply the same force twice to the object, which would be 
equivalent to applying double the force to the object. This thought experiment hints at a space of forces 
that can be added with each other and multiplied with a scalar. In fact, this is how we define a vector 
space in this chapter. A vector space is a set of things, called vectors, and these vectors can be added 
to each other and multiplied by a scalar. In this space, the distributive rule holds between addition and 
scalar multiplication. 

Many things can be vectors, and naturally we can build many different vector spaces, such as a 
collection of points on a plane, a collection of real matrices of the same size, a collection of quadratic 
polynomials with real coefficients, a collection of random variables on a sample space and more. When we 
combine vector addition and scalar multiplication into one operation, we call it linear combination. With 
linear combination, we can ask mathematically interesting questions about a given vector space. Is a finite 
set of vectors minimal such that no vector in the set is a linear combination of other vectors? Are finite 
number of vectors many enough to represent every vector in the vector space as a linear combination? 
The answers to these two questions lead us to the concept of a basis, which is defined as a minimal set of 
vectors representing a whole vector space. From this definition of the basis, we can define the dimension 
of a vector space as the number of vectors in a basis. This allows us to compare two vector spaces, as 
two vector spaces of the same dimension are (roughly) equivalent. 

By introducing a function between vector spaces called a linear transformation, we can tell more 
interesting stories. We soon see that a linear transformation corresponds to a matrix in a one-to-one 
fashion. Therefore, studying matrices equally extends our understanding of linear transformations. For 
a linear transformation described in terms of a matrix, the range of the transformation is defined as 


the column space of the matrix, and the kernel of the transformation is the null space of the matrix. 


Kang and Cho, Linear Algebra for Data Science, 25 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


26 Chapter 3. Vector Spaces 


Characterization of these spaces is a by-product of Gaussian elimination of the matrix. We refine the 
Gaussian elimination further to obtain the so-called row echelon form, whose pattern of non-zero elements 
is essential to finding the column and null spaces as well as the rank of the matrix. Many important 
observations on matrices and vector spaces are related to the rank. We also briefly look at special 
structures of matrices corresponding to geometric transformations like a rotation, a reflection, and a 


projection. 


3.1 Vector Spaces and Subspaces 


We define vector operations in a vector space by collecting all manipulations necessary for solving a 
linear system Ax = b as well as investigating the solutions to its associated linear system Ax = 0 called 
a homogeneous system. When you imagine a vector, you might think of a point in a familiar vector 
space of R”. A vector space is however a much more general concept, including for instance a set of all 


equal-sized matrices and a collection of all real-valued functions that share the same domain. 


3.1.1 Operations in a Vector Space 


There are two basic operations in a vector space; vector addition and scalar multiplication. All other 


operations are derived from these two basic operations. 


1. Scalar multiplication: First, we must think of what a scalar is. In this book, we mostly consider a 
real-valued scalar in R, although a scalar can be either real-valued or complex-valued (C). For any 
scalar c and vectors v, its scalar multiple cv is also a vector. The scalar multiplication is associative, 
ie. (c1c2)v = ci(cov). The multiplicative identity of scalars denoted by 1 works as lv = v. In 
addition, for vectors v and w, we denote Ov = 0 and (—1)v = —v where 0 and —1 are the identity 


and its inverse of scalar addition. 


2. Vector addition: The sum of two vectors vı and vg is also a vector, i.e., vj + V2. Vector addition 
is both commutative and associative; vı + V2 = V2 + vi and vı + (v2 + v3) = (vı + v2) + v3. The 
additive identity is 0 and is often self-evident given a vector space. For instance, some of these 
identities include an all-zero vector, an all-zero matrix, and a constant function that outputs only 


0. We denote w + (—1)v = w — v for simplicity. 


3. Two distributive interactions between vector addition and scalar multiplication: c(vı + v2) = cv, + 
cvg and (c1 + c2)v = cıv + cov. From these, we can derive the inverse of an arbitrary vector v for 
the vector addition. Since v — v = lv + (—1)v = (1 — 1)v = Ov = 0, —v = (—1)v is the additive 


inverse of the vector v. 


3.1. Vector Spaces and Subspaces 27 


Definition 3.1 A set V is a vector space if all vectors in V and scalars in R or C satisfy the 


operational rules above. 


Throughout the rest of this book, we consider a real-valued scalar (R) unless specified otherwise. Some 
of the representative examples of vectors spaces include R”, R, a space of a fixed-size matrices and a 
space of vector-valued functions. In particular, R” is a standard finite-dimensional vector space. We will 
discuss more about the dimensionality later. 

If a subset of vectors within a vector space satisfy the rules of a vector space themselves, we call this 
subset a subspace of the vector space. In this case, any linear combination of vectors in this subspace 
must be part of this subspace, where the linear combination of k vectors v,,..., Vx, and k scalars c1,..., Ck 


is defined as 


k 
C1 V1 +-+ CkÝk = y CiVi. 
i=l 


We define a subspace of a given vector space in terms of the linear combination. 


Definition 3.2 A subspace of a vector space is a non-empty subset of the vector space such that 


all linear combinations stay in the subset. 


In order to show a non-empty subset W of a vector space V is a subspace, all we need to do is to 


check whether W is closed under vector addition and scalar multiplication. That is, we check whether 
e Wc; 
ev.weW = v+twew; 
e cERveEewW = cew. 


A few examples of subspaces of V include {0} (potentially the smallest non-empty subspace), {cv : 
c € R} for a v € Y (a 1-dimensional subspace) as well as {c1v1 +++: + CnVn : C1,---,Cn E€ R} for fixed 
Vi,---;Vn € V. A typical example of a non-subspace is {(z,y) : x > 0,y > 0} in R?. In the case of a 
vector space consisting of matrices, some of the example subspaces include a set of all lower-triangular 
matrices and a set of all symmetric matrices. 

We can interestingly write scalar multiplication as either cv or vc, when v € R”. The former, cv, is 
a standard way to express scalar multiplication to a vector. On the other hand, we can think of vc as 
performing matrix multiplication v[c], where v is an n x 1 matrix and ca 1 x 1 matrix. The latter with 
the associativity of matrix multiplication may help you identify useful expressions with more than two 


multiplicands, such as the one below: 


28 Chapter 3. Vector Spaces 


where u,w € R”. Unlike the first two terms, which are both scalar multiplication, the right-most term is 
matrix-vector multiplication. 


Let us introduce the following notation for summing a set and a vector: 


Definition 3.3 For any pair of subsets, A and B, of a vector space V, we define the sum of A 
and B as® 


A+B={u+v:uceA,ve B}. 


If both U and W are subspaces of YV, and UN W = {0}, we use US W in place of U+ W and call 


it the direct sum. 


“For brevity, we often shorten {v} + A =v +A for vE Vand ACV. 


From this definition, we can derive one important property of the direct sum. If a vector in U6 W 
can be expressed as uy + Vy = U2 + V2 with u1, u2 € U and vj, v2 € W, then u — ug = v2 — vı EC UN W. 
Because U N W = {0} according to Definition 3.3, it holds that u, = uy and vı = v2. In other words, 


there is a unique way to express each vector in U 6 W in terms of vectors from U and W. 


Fact 3.1 A vector in a direct sum has a unique representation: For v € USW, there exists unique u € U 


and w € W such that v = u + w. 


A two dimensional Euclidean space with its two axes is a typical example for demonstrating the 
relationship among the summand subspaces A, B and their direct sum A @ B. The two dimensional 
Euclidean space can be expressed as a direct sum of two subspaces induced from the two axes, R x {0} 
and {0} x R. That is, R? = (R x {0}) 6 ({0} x R), where the symbol x is called a Cartesian product and 
defined as, for any two sets A and B, 


Ax B= {(a,b):a€ A,bE B} 


with a and b that could be real numbers, vectors, or even functions.! This becomes helpful later when 


we encounter a vector space expressed as a direct sum of subspaces. 


3.1.2 Two Fundamental Subspaces induced by Matrices 


Let A = [aj|---| an], a; E€ R” be an m x n matrix. We can readily come up with two subspaces from 


this matrix; an m-dimensional column space and an n-dimensional null space. 


e The column space of A, Col (A) : the collection of linear combinations of columns of A. 
Col (A) = {via + +++ + Vnan : V1,---,Un E R} = {Av: vE R"} cR”. 


We enumerate a few (simple) properties of the column space. 


1Examples of the Cartesian product include R x R = R? and R™ x R”? = R™*™. 


3.2. Solving Ax = 0 and Ax = b 29 


1. Ax = b is solvable if and only if b € Col (A); 
2. Col (In) = R”; 
3. If A is an n x n invertible matrix, then Col (A) = R”. 
4. Col (A) is a subspace of R™. 
e The null space of A, Null (A): the collection of vectors being mapped to 0 via the matrix A. 
Null (A) = {v € R” : Av=0}. 
We often call Null (A) a kernel of A. Here are a few (simple) properties of the null space. 
1. Null (A) = R” if and only if A = 0; 
2. Null (In) = {0}; 


3. If A is an n x n invertible matrix, then Null (A) = {0}. 


4. Null (A) is a subspace of R”. 


When we multiply a matrix with another matrix, the former’s column space often shrinks, unless the 


matrix being multiplied from the right is invertible. In that case, the column space does not change. 


Lemma 3.1 For any pair of matrices, A and B, where the number of columns of A and the number of 
rows of B coincide, Col(AB) C Col (A). If B is invertible, then Col (AB) = Col (A). 


Proof: For any v, ABv = A(Bv) € Col(A). Therefore, Col (AB) C Col(A). Assume B is invertible. 
Set C = AB. Then, A = CB™~ and Col (A) = Col (C B-t) c Col (C) = Col (AB) by the first part of the 


lemma. E 


The null space of a matrix determines the structure of the set of solutions to a linear system defined 


by the matrix. We obtain the solution set of any such linear system by shifting the null space. 
Fact 3.2 Assume Ax* = b. Then, {x : Ax = b} = {x* + y : y € Nul (A)} = x* + Null (A). 


Proof: Let Ax = b. Then, A(x — x*) = Ax — Ax* = b — b = 0 and x — x* = y € Null (A), which proves 
one direction of equality. For the other direction, A(x* + y) = Ax* + Ay = b+ 0 = b for y € Null (A). E 


3.2 Solving Ax = 0 and Ax =b 


Let us revisit the procedure to solve a linear system in Section 2.6. If we interpret Ax as a linear 
combination of column vectors of A, we can regard solving Ax = b as finding a linear combination that 
matches b. Therefore, more independent columns in the coefficient matrix A in Ax = b imply more b’s 


for which there exist solutions. ? An invertible matrix can be thought of as a matrix with the maximal 


2We will define and discuss the notion of independent vectors later in Definition 3.4. 


30 Chapter 3. Vector Spaces 


number of independent columns. In this case, the linear system has a unique solution for any b, i.e., 
x = A~'b. On the other hand, if there exists 0 Æ y € Null (A) (i.e. there is a relation between columns 
of A such as Ay = y1aı +--+ + Ynan = 0), there may be no solution or infinitely many solutions to the 
linear system depending on the choice of b, according to Fact 3.2. In general, if the column space is a 
strict subspace of R™ (i.e., Col (A) G R™), the linear system has a solution only when b € Col (A). 

In order to solve Ax = b, we repeatedly eliminate coefficients of the linear system by adding/subtract- 
ing an equation to/from another equation, resulting in increasingly more zero entries in the coefficient 
matrix A. Gaussian elimination is where we do so by incrementally turning top-left coefficients of linear 
system to zeros. In the resulting coefficient matrix, all entries below a zeroed-out entry are all zeros, and 


we call a matrix in such a form a row echelon form. 


3.2.1 A Row Echelon Form U 


Let us perform the Gaussian elimination on 


3 2 
A=]|2 6 97 
-1 -3 3 4 


e First pivoting: We use is to eliminate all elements of the first column of A except for a11. 


1 0 oj] f1 3 3 2 1332 
ŽA =]|o 1 ollo 0 © 3/=|0 0 3 3| =U 
0 -2 ı| lo 0 6 6 000 0 


The matrices multiplied from left are called elementary matrices, and they are lower triangular. We 
use pivots to refer to the elements that were used to eliminate the others in the corresponding columns. 
In the above example, (A)11 and (£1 A)23 are pivot elements. 


Some properties of the pivot elements are 

1. Pivots are the first non-zero entries in their rows; 

2. A pivot element does not need to be 1; 

3. Below each pivot is a column of zeros after elimination; 


4. Each pivot lies to the right of the pivot in the row above. 


3.2. Solving Ax = 0 and Ax = b 31 


After Gaussian elimination, we end up with an upper triangular matrix with all-zero rows at the 
bottom of the matrix. We call this form of a matrix a row echelon form. 


We get the following matrix by multiplying all the elementary matrices above in order: 


1 0 oO; ]1 00 1 0 0 
L=LI2.l,=|0 1 o| |-2 1 o] =|]-2 1 0 
0-2 1//1 01 5 -2 1 


According to Theorem 2.1, this matrix is invertible, and its inverse is 


ja o oj [1 o of 1 


© 
=] 
— 


L=L ely is =(9- 1.0) 10 2 6) =—|o 10 
—1 0 1 0 2 1 -1 2 1 
Because LA = U and thus L~!A = U, we can write A as 
1 0 O}] J1 3 3 2 
A=IU=1}]2 1 O} J0 0 3 3] = lower triangular x row echelon form. 
—1 2 1/10 0 0 0 


If it is necessary to swap rows during Gaussian elimination, we can apply permutation and still end up 


with a row echolon form of a permuted matrix. This leads to the following result: 


Fact 3.3 For any m x n matrix A, there exist a permutation matrix Q, a lower triangular square matrix 


L with a unit diagonal, and an m x n echelon matrix U which is upper triangular, such that QA = LU. 


A Reduced Row Echelon Form 


In a row echelon form, a pivot may not be 1, and elements above a pivot element may be not 0. We can 
impose these conditions by first multiplying a matrix in row echelon form from left by an appropriate 
diagonal matrix to scale all pivots to be 1 and next multiplying the resulting matrix from left again with 
an appropriate upper triangular matrix to eliminate all non-zero elements above the pivot elements. We 
call the resulting matrix to be in a reduced row echelon form.’ 


Here, we try to obtain a reduced row echelon form of 


1 3 
LQA= |0 0 
0 0 


O ww w 
O © N 


above by additional scaling and elimination. 


e Scaling the second row: we scale the pivot 3 to 1 by multiplying it with a diagonal matrix from left. 


1 0 Of fi 1 3 3 2 
ĎL™'QA= |o 1/3 0| lo 3 3|=ļ0 0 1 1 
0 0 1/10 0 0 0 000 0 


3A reduced row echelon form is unique up to permutation after Gaussian elimination. 


3 
0 


32 Chapter 3. Vector Spaces 


e Eliminating non-zero elements above the pivot elements: 


1 -3 Oj] J|1 3 3 2 1 3 0 -1 
UDL"QA=|0 1 ollo 0 @ 1J=]0 01 1|=R 
0 0 1;/]0 0 0 0 0 0 0 0 
We can rearrange all the steps taken so far, i.e. UDL='QA = R, as 
QA=LD~'U-!R= LDUR, (3.1) 


where every matrix multiplied to A (or modified A), including Q, L, D and U, is invertible. Recall that 
the product of two upper triangular matrix, UR, is also an upper triangular matrix. 


We can illustrate the matrix entries of a row echelon form and a reduced row echelon form as: 


ee ee ee ee ee ee x 10 0* * * 0 

Oe x x x x x x| scaling and 01 x 0” # * 0 
reduction 

U=|]0 0 O0 èo x x x x ae =10 0 0 1x # * O 

000 0 0 0 e 00 0 0 0 0 0 1 

000 0 0 0 0 00 0 0 0 0 0 0 


3.2.2 Pivot Variables and Free Variables 


We call variables (elements) in x that correspond to the rows in R (in a reduced row echelon form) that 
contain pivot elements pivot variables, and the rest of the variables in x free variables. Among many 
different ways to find a solution to Rx = 0, one systematic way is to assign (literally) arbitrary values to 
free variables and determine the values of pivot variables. 


Let us continue from the previous example 


u 

O 3 0 -1 0 
U 

G6 oq 1 = jol, 
wW 

000 0 0 


where u and w are pivot variables, and v and y are free variables. We can express the pivot variables as 


functions of the free variables, as follows 


= = —3v +y 


w = —y 


We can therefore readily determine the values of the pivot variables once we assign arbitrary values to 
the free variables. Although we will only define the notion of dimension rigorously in Definition 3.7, we 
can imagine that the number of free variables is the dimension of the solution space of Rx = 0. 


We can derive a few interesting properties. 


3.2. Solving Ax = 0 and Ax = b 33 


e Ax =0 s Rx = 0: Because Q,L,D and U are all invertible in (3.1), these two systems are 


equivalent. 


e We can express a vector in Null (A) by replacing each pivot variable with its equivalent expression 


in terms of free variables, as follows 


—3w +y —3 1 

v 1 0 
=v +y 

—y —1 

y 1 


This can be thought of as a 2-dimensional plane in the 4-dimensional Euclidean space, R4, geomet- 


-3 1 
, , 1 aml) , 
rically. We can also express it as the column space; Col , where the first column is 
0 1 


a solution to Ax = 0 given v = 1 and y = 0, and the second column given v = 0 and y = 1. 
Let A be an m x n matrix. Then, 
e the number of pivots < min(m, n); 
e ifn >m, there exists at least one free variable, and Ax = 0 has at least one non-zero solution. 


Because it will be useful in later sections, we summarize the last property above into the following 


lemma. 


Lemma 3.2 Ifa matrix A has more columns than rows, Ax = 0 has a non-zero solution. Equiv- 


alently, if Ax = 0 does not have any non-zero solution, then A has at least as many rows as there 


are columns. 


Solving Ax = b, Ux = c and Rx=d 


We are now ready to take the final step of solving a linear system. We assume that we can obtain a 
row echelon form without swapping rows, that is, there is no need to multiply a permutation matrix to 
proceed with Gaussian elimination. 

First, we multiply both sides of Ax = b with a lower triangular matrix L~! to obtain a linear system 


expressed in terms of a row-echelon-form coefficient matrix, i.e. 


L~1(Ax) = L+ (b) 


<=> Ux=c. 


34 Chapter 3. Vector Spaces 


Consider the example above 


1 3 3 2 by 
Ax = 2 6 9 x= |b| = b 
—] -3 3 4 bs 
This corresponds to 
1 0 1 3. 3 2 1 0 Of lb 
DAR | =o). ı 1G) 49 9 eS 1S 1 ol be], Sa 
5 —2 1 -1 -3 3 4 5 —2 1 b3 
resulting in 
1 3. 3 2 by 
Ux= |0 0 3 3|x= by — 2b, =C. 
0 0 0 0 b3 — 2bo + 5b1 


We therefore see that there exists a solution to Ux = c (equivalently, Ax = b) if and only if 63 —2b2+5b, = 
0. 


With b3 — 2b2 + 5b; = 0, we can rewrite the linear system as 
1 3 3 2 
0 0 3 3) X= |bo—2hy 
0 0 0 0 


By setting all free variables to 0 (v = 0 and y = 0), we get a particular solution x, = (3b — b2,0, Ebo — 
2b1,0)', because 3w = b2 — 2b1, w = $b — 3b; and u + 3w = bi, u = 3b1 — bo. 
Now we solve the following system, which is equivalent to the homogeneous system, Ax = 0, of the 


original system: 


u 
1 3 3 2 
v 
0 0 3 3 =0 
w 
0 0 0 0 
y 


As have seen earlier, we get the following two independent solutions: 


e v=1,y=0: xı = (—3, 1,0,0) 


e v=0,y = 1: x = (1,0, —1, 1)! 
We can then express solutions with arbitrary a and (6 as 


3bı — be —3 1 


X = Xp + QX1 + Bx. = ; +a + 6B 


3.3. Linear Independence, Basis, and Dimension 35 


We encourage you to show that Ax = b for this x yourself. 

To summarize, assume U has r pivots when U is a row echelon form of A. That is, it satisfies Ux =c 
which is equivalent to the original system Ax = b. Because the last (m — r) rows of U are all zeros, the 
last (m — r) elements of c must be all zeros as well, for the linear system to have a solution. If so, (n — r) 
elements in x are free variables. 

U may look different as we swap rows during Gaussian elimination. The number of pivots in a matrix 


is however maintained and is called the matrix’s rank. See Section 3.4 for more details. 


3.3 Linear Independence, Basis, and Dimension 


We introduce concepts of linear independence, spanning a subspace, a basis for a subspace, and the 


dimension of a subspace, which are fundamental to linear algebra. 


Linear Independence 
When we refer to a set of vectors as linearly independent vectors, we are saying that it is a minimal set 


of non-redundant vectors. More formally, 


Definition 3.4 For vectors v; € V and scalars ci, suppose c1V1 +--+++CnaVn = 0 holds only when 


Cn = 0. Then, {v1,..., Vn} is linearly independent. If the linear combination is 


zero for some non-zero ci’ s, {V1,---,Vn} is linearly dependent, and some v; can be represented 


as a linear combination of the others. 


Based on this definition, it is important for you to understand the following properties and examples. 


e {0} is linear dependent. 


1 3 238 
e A=/]2 6 9 5| = [ai|ag|ag|ag]: Columns of A are linearly dependent, 
—-Im-3 3 0 


because (—3)a; + lay + 0a3 + 0a, = 0. Rows of A are also linearly dependent, because if we denote 
B = |b; | bə | b3] = A', 5b, + (—2)by + 1b; = 0. 


3.4 2 
e A= |0 1 5]: Columns of A are linearly independent. 
0 0 2 


e A=[ajlag|---| an] and x = (1, £2,..., £n)! € R”: 


{a,,a2,...,4,} is linearly independent 


=> 1a, + 22a. +---+ Znan = 0 implies x = 0 


36 Chapter 3. Vector Spaces 


<= Ax = 0 implies x = 0 


<= Null(A) = {0} 
e Non-zero rows of a row-echelon-form U are linearly independent. Similarly, columns containing 
pivots are linearly independent. 


e A set of n vectors in R™ is always linearly dependent if n > m because of Lemma 3.2. 


Spanning a Subspace 
When a number of vectors can express each and every vector in a vector space as their linear combination, 


we say these vectors span the vector space. 


Definition 3.5 For vectors vj,...,Vn, their span is a minimal subspace containing v1,...,Vn; 


and is described formally as 


span{vy,...,Vn} = {cvi +++: tenn: ci E R}. 


If V =span{v1,...,Vn}, then we say that {v1,...,Vn} spans V. 


Here are two oft-used vector spaces spanned by a finite set of vectors: 
e For A = [a;|ag|---| an], Col (A) = span{ay,ao,..., an}. 
e For e; = (0,...,0,1,0,...,0)' where the i-th entry is 1, R” = span{e,,...,en}. 


Spanning vectors need not be linearly independent. When they are not linearly independent, there 
are many different ways to linearly combine spanning vectors to represent each vector in the spanned 
space. Moreover, linearly independent vectors span a vector space with a unique linear combination for 


each vector within. Fact 3.4 below clarifies this point. 


Fact 3.4 Assume that {v1,...,Vn} is linearly independent. If a vector v can be represented as a linear 


combination of these vectors, i.e., V = £1V1 +++: +2nVn, the coefficients x;’s are unique. 


Proof: Suppose that v = yivj +--:+YnVn holds for some scalars y;’s. If we subtract the latter from 


the former, we get 


(1 — yi)¥i +-+ + (En — Ynn = 0. 
Because of the linear independence, x; — y; = 0, or equivalently x; = y;, for alli =1,...,n. a 


We can further observe that any vector outside a vector space spanned by linearly independent vectors 


is linearly independent of the spanning vectors. 


3.3. Linear Independence, Basis, and Dimension 37 


Fact 3.5 If {vi,...,Vn} is linearly independent and v ¢ span{vi,...,Vn}, {V1,---,Vn,V} és also linearly 


independent. 


Proof: Notice that v 4 0. Suppose that cıvı +--+ + CnVn + Cn41V = 0 holds for some scalars c;’s. If 


Cn41 Æ 0, then we have 
Cy 


Cn 
v= Vio Vn € span{vy,...,Vn}, 
Cn+1 Cn+1 


which contradicts the assumption. Hence c,,,; = 0, and the linear independence of spanning vectors 


implies c} =--- = Ca = 0. E 


Basis for a Vector Space 
The independence property is about the minimality, and the spanning property is about the sufficiency. 
Then a natural question is on minimally sufficient vectors to span a space. 

Definition 3.6 A basis for a vector space V is a set of vectors satisfying both of the following 


properties. 


1. (independence) The vectors in the set are linearly independent; 


2. (spanning) The vectors in the set span the space V. 


A vector in a basis is called a basic vector. 


If linearly dependent vectors span a vector space, we can always find a fewer set of vectors to span 
the same vector space. On the other way, if some linearly independent vectors are not enough to span 
a target vector space, we can incrementally add a linearly independent vector till they are sufficient to 


span the target space thanks to Fact 3.5. 


e Because there is a unique linear combination to represent an arbitrary vector in a vector space 


using basic vectors, we can treat the coefficient of the linear combination in as a coordinate. That 


is, given a basis {v1,..., Vn}, we can represent an arbitrary vector v € V uniquely as 

V = L1V1 Hee F LnYn (3.2) 
where «;’s are coefficients of v with respect to the basis {v1,..., Vn}. So, we may conveniently 
regard (£1,..., £n)! € R” as v. 


For this correspondence to hold, it should be one-to-one. We see that this is true from the span- 


ning property, which states that any arbitrary vector can be represented as a linear combination of 


38 Chapter 3. Vector Spaces 


basic vectors, and the independence property, which states that such representation is unique. In 


summary, when {v,,...,V,} is linearly independent, 
v €span{v),...,vn} "2s" (21,...,0n) ER”. (3.3) 
e There can be many bases for a vector space. When {v1,..., Vn} C R” is a basis of a vector space, 
let B = [v,|---|vV,]. For any arbitrary invertible matrix P, we obtain a new basis for the vector 
1 0 
space by taking the column vectors of BP. For instance, if we multiply B = , whose column 
0 1 
1 0 . . . . 1 —1 
vectors span R?, i.e., R? = span ; , with an invertible matrix P = , we get 
0 1 1 1 
1 —1 


BP = P, implying R? = span ‘ 
1 


Dimension of a Vector Space 


Although there are many bases for a vector space, the number of basic vectors within each basis remains 


identical, according to the following theorem. 
Theorem 3.1 Jf both {vi,...,vn} and {wi,...,Wm} are bases for a vector space V, then n =m. 


Proof: Since both sets of vectors span V, there exist a;;’s such that 


m 
Vj = Q1jW1 Hee +H AmjWm = X agwi for all j = Ly as ag Fs 
i=1 
Let us set an m x n matrix A = (aij). Assume Ax* = 0 for some vector x* € R”. Then, Ji Ti aiy =0 
for all i= 1, .2., 7%. For thig x’, 


n n n m m n m n 


m 
> Livi = > x ( > aijwi) = > > TjlijWi = ` > TjlijWi = > ( hai) Wi =), 
i=1 1 


j=1 = j=] i=l i=l j=l i=] J= 
k J 


which implies x; = 0 for all j since v;’s are linearly independent, that is, x* = 0. Therefore, m > n by 


Lemma 3.2. If we change the roles of v; and w;, then we have n > m. a 


This allows us to use the number of basic vectors in a basis to quantify the size of a vector space. 


We encourage you to think of the following properties and why they hold. 
e dim(R”) =n. 


e (&+1) vectors in a k-dimensional vector space are linearly dependent. 


3.4. Rank of a Matrix 39 


e Any spanning set of vectors can be reduced to a basis, i.e., a minimal spanning set. 
A few observations follow. 


Lemma 3.3 In a finite-dimensional vector space, any linearly independent set of vectors can be extended 


to a basis. 


Proof: Consider k linearly independent vectors {v1,..., Vg} C V, where V is finite-dimensional. Let 
k < dim Y < œ. If V = span{v1,...,Vg}, there is contradiction, as dim V = k. Thus, there exists v € V 
that satisfies v ¢ span{v1,..., Vg}. According to Fact 3.5, we can obtain (k + 1) linearly independent 
vectors, that include {v1,..., V}. We repeat this process of adding one vector at a time, until we obtain 


the basis of Y that contains all the initial linearly independent vectors. E 


Fact 3.6 Let V be a finite-dimensional vector space. Wı and W2 are two subspaces of V. Suppose that 
dim W; + dim W2 > dim Y. Then dim(W, N W2) > dim W: + dim Wo2 — dim V.. 


Proof: Denote dim V = n, dim W; = n1, and dim Wo = ng. Let dim(WıNW2) = k and B = {v1,..., Vk} 
be a basis of WıNW2. Because BG is linearly independent, we can find the bases of W; and Wg, respectively, 


according to Lemma 3.3. We use BU {wxi1,---,Wn,} and BU {ug4i,..., Un, } to denote their bases. 
e Let u = 241UK41 +: +t + 2Zn,Un, E€ W1 N We. Since B is a basis of W1 N Wa, 


n= 4V bee bee, «OT ff Zee Pr + 2g Ung — ViVi — * — OVE, =O. 


Because B U {uz41,.--,Un,} is a basis, v7) = +++ = k = k41 = +++ Zn, = 0, and u = 0. 
e Consider the following zero-vector 


Bi Va Hee A ERVA Veha Wig H+ eg Way Zeca tb ngng = 0. 
—nn + oO Ml 


=v =v =u 
By re-arranging the terms, we get u = —(v +w) € Wj, which implies u € W1 N W2. As we 
have shown above, u = 0, which is equivalent to Zkķk41 = ++: = Zn, = 0. Together with the fact 
that BU {wri f, Wn} is a basis, ti = = = Gp = Yeti = + = Yn, = 0, since v+w=0. In 
other words, {v1,...,Vk,Wk+1;---)Wn,,Uk+1,---;Un,} consists of (nı +n2—k) linearly independent 


vectors, which implies nı +n2g—k <n. 


Therefore, dim(W, N W2) > ni t+ no-n. E 


3.4 Rank of a Matrix 


How many linearly independent columns do we get in a matrix A with r pivot elements? To answer this 
question, let us start with a row echelon form U of the matrix A. U has r pivots, and assume that there 


are at most M linearly independent columns in U. 


40 Chapter 3. Vector Spaces 


e Let {u1,..., Ur} contain r columns of U, including its pivots. For the convenience, we let the pivot 
of u; be the i-th entry, say p;. The r-th equation in the linear system, x,u,; +---+2,u, = 0, is 
Ox, +---+02,_1 + prer = 0. Because the pivot p, of u, is not 0, x, vanishes, which allows us to 


shorten the equation into xju, +---+2,_ ,u,;—-, = 0 in r—1 unknowns. By repeating this process, 


we end up with zı = --- = x, = 0, and therefore {u;,...,u,} is linearly independent. That is, 
M >r. 
e Let {u;,..., Uk} be a set of k arbitrarily selected columns of U, with k > r. Since the r pivots 


lie in the first r rows of U, all m — r elements below in each u; are zeros. With this, we see that 
zyu, +--+ + zkUk = 0, a linear system of m equations, is in fact a system of r linear equations in 
k variables. There are thus solutions that are not 0, according to Lemma 3.2, which implies that 


more than r columns of U can not be linearly independent. Thus, M < r holds. 


Therefore, M = r, and the maximum number of linearly independent columns in U, a row echelon 
form of A, is r. 
Let us continue with the original matrix A. Because Q and L, in LQA = U, are both invertible, the 
following equivalence holds 
Ax =0 Ux =0. 


According to this equivalence, the relationship among the column vectors of A, that is whether they are 
linearly independent, holds the same among the column vectors of U as well. From this, we now know 
that the maximum number of linearly independent columns of a matrix coincides with the number of 


pivots of the same matrix. We call this number the rank of a matrix. 


Definition 3.8 (Rank of a Matrix) Let U be the row echelon form of a matrix A. If U has r 


pivots, then the rank of A is r. We denote it as rank A. 


Because both U, the row echelon form of A, and R, the reduced row echelon form of A, are row 
echelon forms of themselves, respectively, the ranks of A, U and R are same. 

We turn our attention to the maximum number of linearly independent rows of A and its relation to 
the rank. Because A and QA are equivalent up to the ordering of the rows, without loss of generality, 
it is enough to consider the case of Q = I. That is, we consider the case of A = LU. Let rank A =r, 
i.e., there are r pivots in A. It is clear that the r rows on the top of the row echelon form U are linearly 
independent. The rest of the rows are all zeros, and there can be at most r linearly independent rows in 
U. 

Let us create Â and U from A and U, respectively, by collecting only the first r rows of these matrices. 
We also create an invertible lower-triangular matrix L, from L by collecting the first r rows and r columns. 


These matrices are related to each other by A=L,U. Since the rows of U are linearly independent, 


“This holds because the i-th row of LU is the linear combination of the first (i— 1) rows of U when L is a lower-triangular 


matrix with a unit diagonal. 


3.4. Rank of a Matrix 41 


y'A = y'L,U = 0 implies y'L, = 0. Because L, is invertible, y = 0. In short, there are at least r 
linearly independent rows in A, because the first r rows of A are linearly independent. 
Let us consider the other direction. We can see that A = LU where L is created by collecting the 


first r columns of L, because the last (m — r) rows of U are all zeros. With B = LL,~', we get 
A=LU=LL,'A= BA. 


When k >r, A’ = B’A holds for A’, constructed by selecting k arbitrary rows of A, and B’, constructed 
by selecting k corresponding rows from B. Because B’ is a k x r matrix, there exists y Æ 0 that satisfies 
y! B’ =0' according to Lemma 3.2, and thereby y! A’ = y' BA=0'A=0'" implying that the rows 
in A’ are not linearly independent. There are therefore at most r = rank A linearly independent rows in 
A. 


Putting these two parts together, we conclude 


rank A is equal to the maximum number of linearly independent rows or columns ofA. 


We can further conclude the following equality: 


rank A = rank A‘ 


Think of a subspace Col(A) spanned by the columns of a rank-r m x n matrix A = [a;|---| an], 
with the first r column vectors being linearly independent. With k > r, {a1,...,a,,ax} is not linearly 
independent, which means there exist x;’s that satisfy xa +--+ £rar + pax = 0 with x, 40. Then, 


Col (A) C span{aj,...,a,-}, because 


r x 
ar = ajc: a, € span{aj,...,a,}. 
k 


That is, Col (A) = span{a;,...,a,}, and dim Col (A) = rank A. In summary, 


Lemma 3.4 Fora; € R”, 


dim span{a,,...,a,} = rank fa; |---| an] 


or, in a matrix form, 


dim Col (A) = rank A 
for any matrix A. 


This applies equally to the vector space spanned by the rows of A, as dim Col (A!) = rank A! = 


rank A. Also, the multiplication of matrices does not increase the rank of a product. 


Fact 3.7 Suppose A and B aren xm and mx n matrices, respectively. Then, rank(AB) < rank(A) and 
rank(AB) < rank(B). 


42 Chapter 3. Vector Spaces 


Proof: According to Lemma 3.1, Col(AB) C Col(A). When two vector spaces, W; and Wo, satisfy 
W, C Woe, we can establish the relationship between their dimensions as dim W, < dim W3. It then 
follows that rank(AB) < rank A due to Lemma 3.4. The second inequality follows from this, because 
rank(AB) = rank((AB)') = rank(B' A" } < rank(B') = rank(B). a 


3.5 The Four Fundamental Subspaces 


We introduced two subspaces related to a matrix, the column space and the null space. If we consider 
these subspaces for the transpose of a given matrix, we have the following four subspaces related to a 


rank-r m x n matrix A: 
1. Column space Col (A) c R™. dim Col (A) = r; 
2. Null space Null (A) C R”. dim Null (A) = n — r; 
3. Row space Row (A) = Col (A!) C R”. dim Row (A) = r; 
4. Left null space LeftNull (A) = Null (A!) C R™. dim LeftNull (A) = m — r. 


We already characterized the dimensions of the first and third subspaces in Lemma 3.4. Let us take a 
look at the dimension of null spaces. From Gaussian elimination, we showed Ax = 0 & Ux = 0, which 
implies that Null (A) = Null (U). By assigning 1 to one free variable in Ux = 0 and 0 to all the other free 
variables, we get as many linearly independent vectors in the null space as there are free variables, and 
they form a basis of the null space. Therefore, dim Null (A) = n — r, because dim Null (U) coincides with 


the number of free variables of U which is n — r. From these, we arrive at the rank-nullity theorem: 


rank(A) + dim Null (A) = dim Col (A) + dim Null (A) = the number of columns of A (3.4) 


1 0 0 
Example 3.1 Let us find the four fundamental subspaces of A = = U = R. The four 
0 0 0 


subspaces can be written down for this simple matrix, as follows. 


1. Column space: Col (A) = span 


0 
2. Null space: Null (A) = span $ }1] , |0 
0 


3. Row space: Row (A) = Col(A') = span 4 |0 


3.5. The Four Fundamental Subspaces 43 


4. Left null space: LeftNull (A) = Null (A! ) = span 


An Example of Constructing Fundamental Spaces of A from Those of U 


It is often more straightforward to identify the four fundamental subspaces of A from its row echelon 


1 3 3 2 
form U. We study how we can do so in the following example. Let A = | 2 6 9 7|. Then, its 
-1 -3 3 4 


I. 3-3" 2 
row echelon form is U = |0 0 3 3 
000 0 


1. Finding the Row space Col(A'): The basis of Col(A') consists of non-zero rows of U, that 
is, Col(A') = Col(U'). This follows from Lemma 3.1 and the fact that L is invertible and 
pe alae Palle 


2. Finding the Column space Col(A): A basis of Col(U) consists of columns that contain pivot 


elements in U. Because Ax 0 Ux 0, linear independence of some columns of A is 
equivalent to linear independence of the corresponding columns of U. From this, we find that 
dim Col (A) = dim Col (U) = rank A = r. In other words, we can form a basis of Col (A) by 
collecting as many linearly independent columns of A as there are pivots in U. As the columns 
that contain pivots in U are linearly independent, the corresponding columns in A are also linearly 
independent. That is, these columns form a basis of Col(A). In the example above, they are the 


first and third columns of A. 


3. Finding the null space Null (A): Because Ax = 0 <= Ux = 0 from Gaussian elimination, Null (A) = 
Null (U). The number of free variables in U, which is in turn dim Null (U), is n — r. By assigning 
1 to one free variable in Ux = 0 and 0 to all the other free variables, we obtain as many linearly 


independent vectors as free variables, that span the null space, and they form its basis. 


How to Find a Basis of a Spanned Subspace 


One way to find a basis of a vector space spanned by a set of vectors is to stack those vectors row-wise to 
construct a matrix, perform Gaussian elimination on this matrix and collect all non-zero rows in the row 
echelon form as a basis. We can also use all pivot columns of a row echelon form of a matrix constructed 
by horizontally stacking given vectors. Unlike the first method, it is important to notice that we collect 


the columns of A corresponding to the pivots of U to form a basis. 


44 Chapter 3. Vector Spaces 


3.6 Existence of an Inverse 


Consider a rank-m, m x n matrix A which can be written down as A = LU with m < n. In this case, 
there are m pivot columns. We choose an n x n permutation matrix Q such that we can use the first 
m columns of UQ to construct an invertible submatrix U. If it is not necessary to swap columns, take 


Q = In. For instance, we start with 


* 
* 
* 
* 
* 
* 


0 O0O *« * * x 
EB 1A=U=|0 00 x x x 
000 0 x 
0000 0 x 


O * x kx * 
* * * 
Oo, Qe <3. deed, RS OR Kose . 
AGENG] e a | ||-s 2 FEW Oe 2 aaee 
0 0 O * x * 
0 0 0 O x * 


U 
Consider any (n — m) x m matrix H that satisfies GH = 0. Then, with C = Q Lt, it holds that 
A 


AC = (AQ)(Q~'C) = (LUQ =C) = L [ù |a] ene ai 


because 


[a 


a] = ÛÔÛ-! + GH = Im. 
H 


Therefore, C is a right-inverse of A. Note that C depends on H and there may be many H’s, including 
H = 0, that satisfy GH = 0. In other words, there can be many right inverses of A. 

If we had to swap rows in the process of Gaussian elimination and can use a permutation matrix Q’ 
to represent all row swaps, we use Q’A instead of A in the derivation above. The right inverse C above 
then satisfies Q'AC = Im and thereby AC = Q’~", resulting in CQ’ being a right inverse of A. 

On the other hand, there is no left inverse of A. If there were, there exists B such that BA = In. 
Because there are m columns in B, meaning that rank(B) < m and subsequently that rank(BA) < 
rank(B) < m. This is contradictory, since rank(I,) =n > m. 


Here are some additional properties that are useful: 


3.6. Existence of an Inverse 45 


e Similarly to the case above, when rank(A) = n < m, left-inverses exist but there is no right inverse. 


e Consider a case where rank(A) = m = n. This is equivalent to the case of rank A = m < n 
above however without G and H. In that case, the right inverse of A is C = QU~!L~!. Because 
LAO Suv, 

CA=QU "LA = QU-"(L"AQ)Q*' = QU'UQ™ = Im, 
meaning that C is also the left inverse. That is, C is the inverse of A. 


These cases can be summarized into the following theorem. 


Theorem 3.2 For an m x n matrix A, the inverse of A exists only when rank(A) = m = n. If 


rank(A) = m < n, the right-inverse of A exists. If rank(A) =n < m, the left-inverse of A exists. 


The following relationships hold between right- and left-inverses of A and the existence of solutions 
for Ax = b. For any m x n matrix A, 


e rank(A) = m implies the existence of a solution for Ax = b: Col (A) = R” because the row and 


column ranks coincide, and there exists at least one solution for Ax = b for any b. 


e rank(A) = n implies the uniqueness of the solution for Ax = b once it exists: Because dim Col (A) 


n, the columns of A are linearly independent. Therefore, if it exists, the solution to Ax = b is unique. 


The rank of a matrix is upper-bounded by the smaller of the numbers of rows and columns. When 


the rank of a matrix is maximal, we obtain the following additional properties. 


Fact 3.8 Let A be an m x n matriz. 


1. rank A = n case: Then the rank of A'A is also n, and A'A is invertible. (A'A)~!A! is a 
left-inverse of A and A(A!' A)~! is a right-inverse of A! . 


2. rank A = m case: Then the rank of AA' is also m, and AA' is invertible. A'(AA')~? is a 
right-inverse of A and (AA')~!A is a left-inverse of A! . 
Proof: For rank A = n case, it is enough to show that the n x n matrix A' A has a trivial nullspace 
{0}. Assume A'Ax = 0. By multiplying x on the both sides of the equation, we get x' AT Ax = 0. 
If we denote y = Ax, then x'A'Ax = y'y = 0. Since y'y = >>), y? = 0, each y; = 0 and y = 0, 


that is, Ax = 0. Hence x € Null(A). However the dimension balance (3.4) implies dim Null (A) = 0 


and Null (A) = {0}. Therefore rank AT A = n and A'A is invertible. It is clear that (AT A)T!A! is a 
left-inverse of A and A(A' A)~? is a right-inverse of A'. 


For the other case, take B = A' and apply the first result to B. a 


Consider the following simple example for finding a right inverse and checking its relationships to a 
left inverse. 


46 Chapter 3. Vector Spaces 


4 0 0 
Example 3.2 Let A = . Given arbitrary c3; and c32, AC = Ig, where C = | 0 1/5 
0 5 O 


C31 C32 
That is, C is a right inverse of A. We already showed earlier that such a matrix does not admit a left 


inverse. If we multiply A with C from left, we get 


1/4 0 1 0 0 
4 0 0 

CA=|0 1/5 =|0 1 0 
0 0 

C31 C32 4c31 5c32 0 


Regardless of c3; and c32, C is not a left inverse. 


Because rank A = 2, we can derive a right inverse of A according to Fact 3.8, as follows: 


T 1/4 0 

4 0 of {1/16 0 
C* = AN AAY) = / = ho “al /5 
05 0 0 1/235 5 


In this particular right inverse C*, c31 = c32 = 0, and we refer to this right inverse as the pseudo-inverse 


of A, as we will learn later in Section 5.8. 


3.7 Rank-one Matrices 


All columns of a rank-one matrix are scalar multiplications of each other. This fact allows us to represent 
a rank-one matrix as uv! with an appropriate choice of u and v. It is not trivial to check whether the 
rank of a matrix is one, but it is relatively straightforward to find these two vectors, u and v, once we 


know that its rank is one. For instance, it takes effort to tell the rank of the following matrix: 


1 1 
4 2 
8 4 af 
-2 -1 -1 


but once we know its rank is 1, it is not too challenging to find out that we can express this matrix as 
1 
Baer 
—1 


u and v are not uniquely determined, since we can multiply one with a scalar and the other with its 


inverse without changing the resulting matrix. It is sometimes useful to represent an arbitrary matrix or 


3.7. Rank-one Matrices 47 


matrix-matrix product in terms of rank-one matrices. Consider e; = (0,...,0,1,0,...,0)' € R”, which 
is a vector whose elements are all zeroes except for the i-th element. Consider a matrix A whose i-th 
column is a;, such that A = [ai |a2|---| an]. We can express this matrix as a sum of rank-one matrices, 


where each summand rank-one matrix is a;e/ . That is, 
n 
T T T 
A=a;e; + + ane, = > aje; . 
i=1 


We can generalize this further to show that a product of two matrices can also be expressed as the sum 


of rank-one matrices. 


Lemma 3.5 Assume an m x n matrix A = [ai|a2|---|an] fora; E€ R™ and an £x n matriz B = 


[bi |b2|---|b,] for b; € R. Then, 


AB’ =) ab, . (3.5) 
i=l 


Proof: We first rewrite A and B as A = X; ae] and B = J`; bie; , respectively. We then notice 


wy? 


that e/ e; = 1 when i = j and otherwise 0. Then, 
L A T 
(Save!) (So bye) 
i=1 j=l 
(Sac!) (Ze) 
i=1 j=l 
- LYaelen] 


i=1 j=l 


n 
g=] 


AB! 


E 
We generalize it even further by considering a case where a diagonal matrix is inserted between A 


and B': 


Corollary 3.1 Assume an m x n matrix A = [aj|az|---|an] fora; E R”, an Lx n matriz B = 


[bi |b2|---|b,] for b; E Rf, and an n x n diagonal matriz A = diag(\1,..-,An). Then, 


AAB' =X Aab] . (3.6) 
i=l 


Proof: Because AA = [)1a1|A2a2|---|Ananj, we can apply Lemma 3.5 to (AA)B' to prove this 


statement. o] 


We will see the usefulness of these various summation of rank-one matrices later. 


48 Chapter 3. Vector Spaces 


3.8 Linear Transformation 


In this section, we consider transformations of vectors between vector spaces that seamlessly work with 
vector addition and scalar multiplication. That is, we consider transformations that are compatible with 
vector addition and scalar multiplication, under which adding two transformed vectors is equivalent to 
transforming the sum of these two vectors as well as multiplying a transformed vector with a scalar is 
also equivalent to transforming a vector scaled by the same scalar. We call such a transformation a linear 
transformation, and this can be visualized as a line that passes through the origin in the 2-dimensional 
space and a plane that passes through the original in the 3-dimensional space. We can more precisely 


define linear transformation as: 


Definition 3.9 We call a function T : Y —> W defined between two vector spaces a linear trans- 


formation if it satisfies the following properties given two arbitrary vectors v1, V2 € V:% 


e T(vi + v2) =T(v1) + T (v2); 


e T (avi) = aT (v1). 


We call such a linear transformation a linear map as well. 


“These two properties can be combined into one condition; “ 
T(avı + v2) = aT(v1) + T(v2) 


for any scalar a € R and a pair of arbitrary vectors v1, v2 € V.” 


This definition of linear transformation does not involve a basis of either V nor W, which brings up a 
question of whether there exists an efficient way to describe any linear transformation beyond specifying 
how each and every vector from V maps to a vector in W. In order to answer this question and find 
such an efficient way, we consider finite-dimensional vector spaces in this book. That is, we assume that 


dim(V) = n and dim(W) = m. 


3.8.1 A Matrix Representation of Linear Transformation 


An arbitrary vector v in V can be expressed as a linear combination of the basic vectors with a unique 
coefficient set, £1,..., £n, given a basis By = {v1,...,Vn}.° That is, v = £1V1 +--+: +2nVn. If T were a 


linear transformation, 
T(v) =T(aivi +... + EnVn) = 71T (v1) +: + nT (vn), 


which implies that we can describe T given the basis By by describing T(v;) of W that corresponds to 


each basic vector v; of V. In other words, we just need to determine {T(v,;) : j =1,...,n} and identify 


5This is because basic vectors are linearly independent and they span the vector space. 


3.8. Linear Transformation 49 


{t;:j=1,...,n}, that satisfies v = £1V1 +---+2nVn, in order to map an arbitrary v € V to W via T. 
With those information, we can evaluate T(v) by ae oe (v3) 
Let By = {wi,...,Wm} be a basis of W. Because T(v,;) € W, we can represent each T (vj) as a linear 


combination of the basic vectors w;’s and their coefficients {a1;,...,@mj} C R: 


m 
T(v;) = Q1ijW1 +. + amjWm = ò QijWi. 
i=1 


Combining two representations, we can then express T (v), for an arbitrary v € V, as 


TO) = AT) +--+ nT (vn) =) ajT (wy) 
j=l 
= Ys(Savm) 
j=l i=1 
= $ (San) 
i=1 `j=1 
By denoting the coordinate of T (v) under Bw as (y1,..., Ym), we have a purely algebraic equations in 


real numbers: 


because T(v) = >i", y;w;. Putting these all together, we observe that 


y = Ax, 


for an n-dimensional real vector x = (%1,...,2n)' € R” and an m-dimensional real vector y = (y1,.--, Ym)! € 


R”, given an m x n matrix A = (aij). These correspondences are illustrated in the following schematic 


diagram: 


Və v linear transform T w=T(v) €W 
By :| v= Ba LjVj Bw :| w = J;e YiWi 
R°> x matrix A y=Ax ER” 


There are a few interesting aspects we should keep in our mind about this relationship: 


e V and W are not defined in the context of R” and R™. They are general vector spaces that may 


consist of any vectors, such as n-th order polynomials. 


e Other than that they are linearly independent, there is no more qualification on vectors forming a 


basis. 


50 Chapter 3. Vector Spaces 


e The coordinates x1,...,%, tells us about the relationship between the individual vectors and a 
basis and does not say anything about T. Even with the same vector spaces Y, W and a fixed linear 
transformation T, the transformation matrix A changes if we choose to use a different basis for V. 
Even when we fix the basis of V, A changes if we choose another basis for W. In short, the matrix 


that represents T depends on the choice of the bases for the domain and range. 


e We are often familiar with standard bases, By = {e1,...,en} and Bw = {e4,...,e/,}, for V = R” 
and W = R”, respectively. In this case, T is represented by a transformation matrix A = (aij), 
where aij = (T(e;)), = T(e;)'ej. This is only a special case and should not be considered as a 


general case nor a representative case. 


Example 3.3 A set V includes all polynomials of degree 2 or less, that is, V = {ag + ait + agt? : 
ag, 41,42 E€ R}. Let T map a polynomial f(t) to its derivative f’(t). 


1. It is easy to check that V is a vector space over the scalar set R. Check it yourself. 


2. Set By = {1,t, #7}. It is clear that V = span By. To see the linear independence, assume co + cit + 
cat? = 0 for all t. Because we get co = c1 = c2 = 0 by sequentially trying 0, 1 and 2 for t, By is 
linearly independent. Therefore, By is a basis for V and dim V = |By| = 3. 


3. T is a linear map from Y into V. That is, T (ao + ait + azt?) = aı + 2ast € V. T is linear since the 
differentiation is linear, that is, (af (t) + bg(t))’ = af'(t) + bg'(t). 


4. Under basis By, 


ao+at +a EV <— > (a1,41,a2) €R®. 


Through T, the entries of the matrix representing T are decided as follows: 


T(1) = 0=0-1+0-t4+0-?? Q11 0, a21 0, a31 0 
T(t) = 1=1-14+0-t+0-#? a12 1, a22 0, a32 0 
Te?) = A=0.1+2-t+0- t? a13 = 0, a23 = 2, a33 = 0. 
0 1 0 
3 x 3 matrix A representing T from By into By is given by A= |0 0 2 
0 0 0 


Example 3.4 A set V includes all polynomials of degree n or less, that is, V = {ao + ait +... + ant” : 
a0, @1,---,4, E R}. Let T map a polynomial f(t) to its derivative f’(t). n is a fixed integer. 


3.8. Linear Transformation 51 


1. It is easy to check that V is a vector space over the scalar set R. Try it yourself. 


2. Let By = {1,t,#?,...,¢7}. It is clear that V = span By. To see the linear independence, assume 


ag+a,t+---+a,t” = 0 for all t. By plugging in various values into t, we get ag = ay = +--+: = an = 0, 


and thus By is linearly independent. Therefore, By is a basis for V and dim V = |By| =n +1. 


3. T can be thought of as a map from V into V. If we set W = {ao + ait +... + Gynt”! : 
ao, 41,..-,@n—1 € R}, T is also a map from Y onto W. It can be also shown that T is a linear map 
in both cases. By = {1,t,t?,...,¢?~'} is a basis for W. 


4. For each k = 0,1,...,n, 


T(t) = ke®1=0-140-¢4+---4+0-¢-? +h. e140. 0h +. PO 
= A1(K41) =9,---, 0(k—1)(k+1) = 0, ak(k+1) = K, a(k+1)(k+1) = 0; - - -3 a(n) (k+) = 0- 
0 1 00> 0 
0 2 0 0 
0 0 0 


The (n+1)x(n+1) matrix A representing T from By into By is then A = 


0 n—-1 
0 0 0 
0 0 
0 1 0 0 0 
0 0 2 0 0 
The n x (n+1) matrix A’ representing T from By onto By is A’ = ees i : 
000 0 > n-1 0 
000 0 -> 0 n 


Observe that the matrix representations under different bases may be different. 


Composition of Linear Transformations 


Let U, V and W be vector spaces with their dimensions, n, m and £, respectively. We consider two linear 
transformations/maps, S : U — V and T : V + W. Furthermore, let A = (apj) and B = (bip) be the 
transformation matrices representing S and T, respectively, with respect to three bases By = {u1,...,Un}, 
By = {v1,..., Vm}, and Bw = {w1,...,we}. We now derive an @ x n matrix C that represents the 


composition of S and T, ToS:U—W, just like what we did before. 


52 Chapter 3. Vector Spaces 
We can express S(u,) for a basic vector u; of U as a linear combination of basic vectors of V with 


appropriate az;’s, 


We can similarly write T(vẹ) for a basic vector v;, of V as a linear combination of basic vectors of W as 


£ 
T (vx) = a bikWi. 
i=1 
We then compose these two, as follows: 


(ToS)(uj) = AOD arve ) 
k=1 


This can be rewritten as 


where 
That is, C = BA. 
The following theorem summarizes this result. 


Theorem 3.3 Let A and B be the matrix representations of linear transformations S and T, 


respectively, with respect to some bases. For the same bases, the matrix representation of ToS is 


BA, and T 0S is a linear transformation corresponds to a matrix representation BA. 


3.8.2 Interpretable Linear Transformations 
There are some cases where the transformation matrix A, in T(x) = Ax, is intuitively interpretable. 
a 0 


e A Scaling Matrix: A matrix A, in the form of al = ta , multiplies each element of x 


by a scalar a, as T(x) = alx = ax. 


3.8. Linear Transformation 53 


—1 =p 
e A Rotation Matrix: Take as an example a 2 x 2 matrix A= and T(x) = Ax = í 
1 0 Tı 


which rotates x counter-clock-wise 90°. We can generalize such a matrix as 


cos —sin0 
0 = 
sin? cos 
such that Rex rotates x by 0. Using Theorem 3.3, we see that Rọ Rẹ corresponds to rotating a 
vector by ġ and then by 0, which is equivalent to rotating the same vector by 6+ ¢. From this, we 
see that 


RaRo = Ror. 


From this relationship, we can further derive that R3‘ = R_», because RọR—ọ = I. We can also 


easily check that R_9 = Rj. 


1 0 
e A Projection Matrix: Consider P = which projects a vector x on the x axis, as T(x) = 
0 0 
x 
Px = | "|. We can generalize such a matrix to perform projection of a vector on a line that passes 
0 


the origin at the angle 6. We use Py to refer to the transformation matrix corresponding to this 
particular linear map T. Two basic vectors in R?, e} = (1,0)! and e> = (0,1)', are mapped to 


T(e1) = (cos 0 cos 6, sin @ cos 0)! and T(e2) = (cos @sin 6, sin @ sin 0)! , respectively. We can then get 


cos?@ — sin@ cos 6 


sin 0 cos @ sin? 0 
because 
T(x) = xT (e1) + z2T (e2) 
cos 0 cos 0 sin 0 cos 8 
= - Vii + T2 
sin 0 cos 0 sin 6 sin 0 


cos?@  sinĝcosð| | 2x1 


sin 6 cos 8 sin? 0 T2 


Since a projected vector should remain as it is even if it is projected once more, P? = P must 
hold for any projection matrix P, and indeed P above satisfies this condition. Another interesting 


property of projection is that I — P is also a projection matrix if P were. 


e A Reflection Matrix We can reflect a vector x on the other side of a line that passes the origin 


0 1 
at the angle of 45° by multiplying it with A = , as T(x) = Ax = ve 
1 0 Tı 


54 Chapter 3. Vector Spaces 


Let us generalize this to work with any given angle 0 by defining Hy». We first realize that the mid- 
point vector between x and T(x) must be identical to x projected to the reflecting line. That is, 


$(x+T(x)) = Pox. From this, we can derive Hp by Ho = 2P9—I, since T(x) = 2Pox—x = (2P9—I)x. 


Even more generally, consider a reflection matrix for any arbitrary subspace of R”. Given an nxn 
projection matrix P that projects a vector onto this subspace, (i.e. it is symmetric and satisfies 
P? = P) we can construct a reflection matrix H by H = 2P — I. This matrix satisfies H T = H and 
H? = (2P — I)? = 4P? — 4P + I = I. Conversely, given an n x n matrix H that satisfies H! = H 
and H? = I, we can obtain the projection matrix P by P = $(H +1). This matrix P then satisfies 
P' = P and P? = $(H?+2H+1) = į (2H +21) = P. Because I — P is a projection matrix when 


P is also a projection matrix, 2(I — P) — I = I — 2P is also a reflection matrix. 


3.9 Application to Graph Theory 


We discussed the incidence matrix associated to a directed graph in Section 2.8. We investigate various 


aspects of this incidence matrix. Let us consider the following simple directed graph. 


Cy a— o 


C3 r 4 


ey —1 1 
The incidence matrix of this graph is given by A = is = i ' . Denote 1 = 
€3 = 
e€4 —1 1 
e5 —1 1 
(1,1,1,1)! and the i-th row of A by r;. Then, we observe the followings. 


1. Al = 0, that is, 1 € Null (A). This property holds for all incidence matrices of directed graphs since 


their each row contains exactly one 1 and one —1 representing a directed arc. 


2. A dependence relation rı —r2+r3 = 0 implies that {r1, r2, r3} is linearly dependent and (1, —1, 1,0,0)! € 
Null(A'). If we discard the direction of arcs in the graph, arcs e€1,€2,e€3 corresponding to rows 
rı, r2,r3, respectively, constitute a cycle or loop. The correspondence between a cycle in a graph 


and a dependence relation among row vectors of its incidence matrix holds for all graphs. 


3. If any two nodes are connected by arcs discarding their directions, the graph is called a connected 


graph. If the graph has no cycle, the graph is called acyclic. If a graph has both properties above, 


3.10. Application to Data Science: Neural Networks 55 


that is, a graph is acyclic and connected, we call the graph a tree. It can be shown that the rows of 
the incidence matrix corresponding to arcs in a tree are linearly independent. Furthermore, it can 


be also shown that a tree with n nodes has a n — 1 arcs by mathematical induction. 


4. rank A = n — 1 for an incidence matrix of a connected directed graph with n nodes. First, we get 
rank A < n — 1 from Al = 0. The row vectors corresponding to n — 1 arcs constituting a tree® are 


linearly independent, and rank A > n — 1. A connected graph always has at least one tree. 


5. Null (A) = span{1} since Al = 0 and dim Null (A) = 1 from rank A =n- 1. 


3.10 Application to Data Science: Neural Networks 


In machine learning, a neural network is often implemented as alternation between linear and nonlinear 
transformations, and linear transformations are often expressed as weight matrices. Training such a neural 
network corresponds to adjusting these linear transformations so as to minimize the difference between 
observations and neural network’s predictions. When visualizing a layered neural network as a directed 
graph, or a network, we use a node to represent a nonlinear transformation and arcs correspond to linear 
transformations. For instance, we can visualize a neural network that takes as input a 2-dimensional 


input x = (21,22)! € R?, as below. 


Computation happens left-to-right in this neural network. The input (21,22) is linearly and nonlin- 
early transformed into the first intermediate quantities (y1, y2,y3). These are linearly and nonlinearly 
transformed into the next intermediate quantities (z,, 22). The final two are linearly transformed once 
more to form the final output t. It is visibly apparent from the figure above how this neural network is 
layered; (x1, x2) is the input layer, (y1, y2,y3) and (21, 22) are two hidden layers, and (t) is the output 
layer. Such a layered structure enables efficient computation even with a neural network with many 
nodes, such as by using a general-purpose graphics processing unit (GPU). 

Consider linear transformations within this neural network. The first linear transformation from 


(%1,X2) to (Y1, y2,y3) can be expressed as a 3 x 2 matrix, according to Section 3.8.1, since it is linear 


6A tree in a graph is an acyclic connected subgraph. It is well-known that every tree has n — 1 arcs if the graph has n 


nodes. 


56 Chapter 3. Vector Spaces 


(1) 
ig): 
), and the final one as a 


transformation from a 2-dimensional space to a 3-dimensional space. Let this matrix be W® = (w 
The next linear transformation can be expressed as a 2 x 2 matrix W(?) = (w 
1 x 2 matrix W®) = [w wh]. Each arc in the directed graph is associated with one of the elements of 
these transformation matrices. We can visualize this by putting the associated matrix element, to which 


we often refer as a parameter, on top of the corresponding arcs, as below. 


We use a) to denote the k-th nonlinear transformation. Such nonlinear transformation is often 
called an activation function, and it is often applied point-wise, that is, it is applied to each node in the 
layer independently. Let us use ^ to denote the value of each node prior to nonlinear transformation, 


such as ĝ;, 2; and Ê. In this particular example, these pre-activation values are computed by 


(1) (1) 


£2, 2 = W31 L1 + W33 H 2 


i 1 r i 1 
(1) (1) £2, Y3 = W31 Lı + W37 T2. 


Yı = Wig T1 + Wiz 


We apply the activation function to these pre-activation values to get the final values of the first layer: 
yı = 0) (G1), y2 = oY (G2), y3 = o® (G3). It is a common practice to apply the activation function 
to a vector, which is equivalent to applying the activation function to each element of the vector. This 


simplifies the equation above into 


Wi, Wiz y 
p=wx= |u wh) |], y= 
a w 72 
W3; W32 


The same procedures applies equally to the next layer: 2, = wo yn + why + wl? ys, 22 = 


(2) (2) (2) 


Woy yı + Woy Yo + w35 y3 and z1 = 0)(21), z2 =o") (22). This is simplified into 


y= 2 (2 2 
wy who wy 


(3) (3) 


In the final layer, we first compute Ê = w} z1 + w5” z2 and apply the activation function to get 


t = o0)(t). This is equivalent to 


3.10. Application to Data Science: Neural Networks 57 


In summary, this neural network performs the following computation:’ 


t= 0° (W® oO (W2 6 (Wx))). 


(k 
ij 
network and a desired (target) output. Modern neural networks sometimes have billions or even hundreds 


Learning corresponds to adjusting w ) to minimize the difference between the output from the neural 
of billions of parameters. 

Earlier, it was usual to use a so-called sigmoid function (1 + e~*)~! which is a bounded S-shaped 
curve. This choice however made learning greatly challenging. It has become more common in recent 
years to use a rectified linear function max{x,0} instead, which is considered one of the major reasons 


behind the explosive growth of deep learning since 2012. 


3.10.1 Flexibility of Neural Network Representations 


An interesting and consequential question we can ask is how nonlinear a neural network that implement 
linear combination followed by such a simple activation function. Consider a neural network f(x,y; 0) in 


Figure 3.1. 


Hidden Layer 1 


Figure 3.1: A simple neural network with 2 hidden layers and 121 learning parameters 


The output f(x,y; ©) is determined by two hidden layers of sizes 16 and 4, respectively, given an input 
(x,y). These 20 nodes in the hidden layers use rectified linear functions as their activation functions, 
and we use a sigmoid function at the output layer. There are 100 arcs and 21 nodes, excluding the input 
nodes, resulting in 100 weights and 21 biases. Let © € R!?! a collection of these parameters. As we alter 
©, the output of the neural network given the same input (x,y) changes. In other words, © determines 


the function expressed by the neural network. 


TFor brevity, we omit an extra scalar added to the pre-activation value of each node, called a bias. 


58 Chapter 3. Vector Spaces 


We obtained four functions, respectively corresponding to ©,,...,04, by solving the following mini- 


mization problem using four real data: 


n 


n, Do Cy) — All 
i=1 


In Figure 3.2, we plot these four functions represented by four parameter sets. The diversity and com- 


plexity of these functions demonstrate that we can get vastly different, highly nonlinear functions by 


simply varying the parameters of the same neural network. 


100 


(b) Cliff shape 


(d) Bridge shape 


(c) Splash shape 


Figure 3.2: Various output representations by a simple neural network 


Chapter 4 


Orthogonality and Projections 


How can we quantify the geometric relationship of two vectors in a vector space? Unlike on a plane, 
there is no left side of a vector nor clockwise direction at the origin in the Euclidean spaces of dimension 
higher than 2. So, we do not care about the order of two compared vectors, and the desired quantity 
must be symmetric in two vector arguments. Also, carrying over the universal linearity in linear algebra, 
the quantity needs to be linear in each of the two compared vectors, which is called bi-linearity. We 
also define the quantity for a non-zero vector in both arguments as the squared norm of the vector. You 
may regard the norm as the length of a vector. We call this quantity an inner product. Because we 
can define many inner products in a finite-dimensional vector space, as will be shown later, the value of 
an inner product between two vectors has no absolute meaning except when it vanishes to zero. With 
a norm induced from any inner product, two vectors satisfy the Pythagorean relation among the two 
vectors once their inner product vanishes. We say that two vectors are orthogonal in this case. The 
orthogonality extends naturally between a vector and a subspace as well as between two subspaces. 
Along this orthogonal direction, we project a vector onto a subspace. The subspace can be spanned by a 
vector, orthogonal vectors, or independent vectors, for each of which we have a projection representation 
in terms of the inner product. If a basis consists of basic vectors orthogonal to each other, the coefficients 
of a vector in its linear combination representation work as the coordinates in the Euclidean spaces, 
simplifying many computations involving inner products. Hence, it is important to get orthogonal basic 
vectors. Fortunately, there is a systematic way to obtain orthogonal basic vectors incrementally called 
the Gram-Schmidt process. The Gram-Schmidt process is extremely helpful and applied throughout the 
rest of this book. 


4.1 Inner Products 


We start by introducing the inner product between two vectors and the norm of a vector. 


Kang and Cho, Linear Algebra for Data Science, 59 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


60 Chapter 4. Orthogonality and Projections 


Definition 4.1 An inner product (v1,v2) between two vectors, vı and v2, in a vector space V is 


a real function satisfies the following properties: 
1. (v1, V2) = (V2, V1); 
2. (cv1, V2) = cC(V1, V2) for any real number c; 
3. (vi + vo, V3) = (v1, V3) + (V2,v3) for any v3 € V; 
4. (v1, v1) > 0 if and only if vı £0. 


When an inner product is defined, we use it to define a norm induced from the inner product as 


lvi=~V/(v,v) for eachv€ Y, 


just like the absolute value of a real number. If the norm of a vector v is 1, that is, (v, v) = 1, we 


call v a unit vector. 


From the second property of an inner product in Definition 4.1, we see (0, v1) = 0. (v1, V2 + V3) = 
(v1, V2) + (v1, V3) can be shown if we combine the first and third properties. Because a(v1, v2) is an inner 
product for any positive scalar a, we can easily see that there are infinitely many different inner products 
if we have at least one. Similarly, the sum of two inner-products (-,-)1 and (-,-)2, (V1, V2)1 + (V1, V2)2, is 
also an inner-product. We study how to characterize all possible inner products in Theorem 4.1. 

From the definition, we see that there might be many norms on a single vector space since many inner 
products exist. Generally, a norm of a vector in a vector space V is a real-valued function f that satisfies 


the following three properties: 

e f(v) > 0 forall v € Y, and f(v) = 0 if and only if v = 0; 

e Scalar multiplication: f(cv) = |c| f(v) for all c € R and v € Y; 

e Triangular inequality: f(v +w) < f(v) + f(w) for all v,w € V. 
Together with Fact 4.2 below, we can show that the norm defined using an inner product satisfies the 
above three properties, and hence it is truly a norm. 

Let us study an inner product further first by showing that the Cauchy-Schwarz inequality holds in a 


vector space as well. 


Fact 4.1 The Cauchy-Schwarz inequality holds between an inner product and the norm induced by the 
inner product: 
|(vi,v2)| < [val [v2]. 


Proof: For any real number t € R, 


\tvy +Y? = (tvi + Vo, tYi + v2) 


4.1. Inner Products 61 


= (tvi q Vo, tv1) ae (tvy = ia V2, V2) 
= (tv1, tv1) + (vo, tv1) + (tv1, V2) + (v2, V2) 


= ¢7(v1,v1) + t(va,v1) + t(v1, v2) + (ve, V2) 


= t?(v1, v1) ballad 2t(v1, V2) + (v2, V2) ” 


Since |tv, + v2| > 0 for any t, the quadratic equation on the right-hand side cannot have two different 
solutions. That is, 
0> (v1, V2)? — (v1, V1) (V2, V2) = (v1, v2)? = lvl? . [vo|?. 


From the Cauchy-Schwarz inequality, we can derive the triangular inequality with which we find the 


definition of the norm from an inner product much more natural. 
Fact 4.2 The triangular inequality holds for the norm induced from an inner product (-,-): 
[vi + v2] < [vı] + [v2] 
and furthermore the positive homogeneity holds: 
|cv| = c|v| for any positive real number c. 
Proof: 


[vi + v|? = (v1 + V2, V1 + V2) 


vı + v2, v1) + (v1 + V2, V2) 


v1, V1) + (V2, V1) + (v1, V2) + (V2, V2) 


( 
( 
( 
( 


v1, V1) + 2(v1, V2) + (va, V2) 


IA 


[vi |? + 2[va| - [val + |vel? 


(|vi| + |v2|)” 


levi| = y (ev, ev) = y (v, v) = cy (v, v). 


As the first example of an inner product, consider the following vector space consisting of polynomial 


functions. 


Example 4.1 Let V be a vector space of polynomials of degrees less than or equal to n. For two 


polynomials f(t), g(t) € V, we define an inner product 


= J OOd 


For f(t) = t and g(t) = t°, 


62 Chapter 4. Orthogonality and Projections 


f2,f@g(t)dt = f1, g(t) f(t)dt is linear in f and g. f? f(t)2dt = 0 implies f = 0 since f is a 


polynomial which is continuous. Hence, (-,-) is an inner product. 


(f.g) = fo, t- Pdt = f°, Bdt =0. 


l= VE = fT ea = fel = 2 
Ig = V G9) = VS, Hat = Vel. 2. 


if — al = VIP- Bo) + oP = of? -2-04 2 = 4/8. 


Let us analyze an inner product defined over a vector space. Rather than considering all possible 


vector pairs within the vector space, we focus on pairs of linearly independent vectors. 


Lemma 4.1 Assume non-zero, linearly independent vectors v1,...,V~ in a vector space V. Define a 


kx k matric A = (aij), where aij = (vi, vj). Then, A is symmetric and invertible. 


Proof: Symmetry is naturally derived from the definition of the inner product. To check the invertibility 
of A, we need to show that Ax = 0 has no non-trivial solution. Let x € R* satisfy Ax = 0. The i-th 


equation of Ax = 0 is 
QiyLy + +++ + GREE = L1 (Vi, V1) + +++ + FK(Vi, Ve) = 0. 
Rearranging the equation using the linearity of inner product gives 
(vj, 21V +--+ +H Ekv) =0 fori =1,...,k. 


If we add the equations after multiplying x; to i-th equation, the linearity again allows 


j= 


i 


k 
Li lVi 11 A aie ei TKY kI = (zıvı A fis ie LeEVk, L1V1 Spans ae LEVk) . 

=1 

Then the definition of inner product imposes £1Vı + +--+ 2,Vvz, = 0, which implies x = 0 by the linear 


independence of v;’s. Hence, the null space of A is {0}, and A is invertible. E 


Characterization of an Inner Product 


Using Lemma 4.1, let us see how we can characterize an inner product in terms of inner products of basic 
vector pairs. 

Consider a basis By = {v1,..., Vn} for V. For any two vectors v and w in V, there exist two unique 
yr 


vectors X = (%1,...,%n)' and y = (y1,...,yn)' in R”, respectively, such that 


V = 1V1 +e +H anv, and W= YV tH: Ynn. 


4.1. Inner Products 63 


Then, the bilinearity of inner product implies 
n n n n 
(v,w) = oF TiVi, 5 yvy) = 5 >X TiYj (Vi, Vj) + 
i=1 j=1 i=1 j=1 


If we set aj; = (v;i, vj) and construct an n x n symmetric matrix A = (a;;), 


n n 


(v,w) = 5 X tiyjaij =x! Ay. 


i=1 j=1 


Since v = 0 € Y if and only if x = 0 € R”, (v,v) = x! Ax > 0 if and only if x £0. Hence, the A can be 


characterized as 
an n x n real symmetric matrix such that x' Ax > 0 for any x £0. 


Therefore, a quadratic form taking positive values for any non-zero vectors characterizes an inner product. 
The matrix appearing in the quadratic form also characterizes the inner product. This property is called 


positive definiteness and formalized by the following Definition 4.2.1 


Definition 4.2 A square matrix A is positive definite if x'Ax>0 for allx 40. 


Conversely, with a positive definite matrix A, a bilinear function, defined as 


(Yew Duv) = x! Ay, (4.1) 
i=1 j=l 


induces an inner product by Lemma 4.2. 


We can summarize this observation in the following theorem: 


Theorem 4.1 An inner product in an n-dimensional vector space is characterized by ann xn symmetric 


positive definite matrix as in (4.1). 
To complete the proof of Theorem 4.1, we must introduce the following lemma. 


Lemma 4.2 Let V be a vector space and A be an n x n symmetric positive definite matrix. Fix an 


arbitrary basis of V, such as By = {v1,...,Vn}. For any two vectors v and w in V, let (v,w) =x! Ay, 


where x = (@1,...,%,)' andy = (y1,---, Yn)! satisfy v = Yi; viv; and w = oi, yivi. Then, (v,w) 


is an inner product of V. 


Proof: Let v= )7j_) 2i¥i, W = 1 wi, and u= >>. 


¿=i ZiVi. 


e (v,w) =x! Ay = (x? Ay)? =y" ATx = y" Ax = (w,v); 


e (cv,w) = (a)! Ay = c(x' Ay) = c(w, v); 


1We learn positive definite matrices in more detail in Chapter 7. 


64 Chapter 4. Orthogonality and Projections 


e (v+u, w) =(x+z)' Ay =x' Ay +z! Ay = (v,w) + (u, w) 


e (v,v) =x! Ax > 0 if x £0 and equivalently v Æ 0. 


Example 4.2 Let V be a collection of polynomials of degree less than 3. That is, V = {ao +a12 +422? : 
a; E€ R}. We discussed that V is a vector space. Define (f,g) = in f(x)g(a)dax for f,g € V. (-,-) is an 
inner product on V since integration is linear in integrand. Let us characterize this inner product using 


a positive definite matrix with respect to a basis {1, x, x7}. 


1 1 
G1) = Í 1- idz =2, a= 1- dx = 0 


1 -1 
1 1 
2 2 
(1,27) = i 1-a7de = —, (2,2) = f z mdr R> 
-1 3 4 3 
1 1 2 
(x,x27) = I xr: xz’°dz=0, n. x? de = — 
-1 At 5 
2 0 3 
The matrix corresponding to this inner product is given by | 0 Z 0 i 
2 2 
3 9 5 
2 0 5 
Example 4.3 Let V = R? with a bilinear function (x,y) =x! Ay for x,y € V where A = |0 2 0 
2 2 
3 0 5 
From 
2 0 Z Tı 
xl Ax = [|z x2 x3] | 0 2 0] fz 
2 2 
3 0 5] [e 
4 2 2 
= 2g? + 37173 + 372 + 513 
1 2 2 8 


we see that (x,x) > 0 and (x,x) = 0 implies x = 0. (x,y) is also bilinear. Therefore, it is an inner 


product. This inner product is equivalent to the one from Example 4.2 above, as 


(100,100) = 2, ({100],[010)) =o 
(L0G, (001) = 5, (010),(010) =5 
(010,001) = 0, ((0.0 1], (001) =<. 


4.1. Inner Products 65 


Even in a Euclidean space, which is frequently used by and familiar to us, we can define a variety of 


inner products. Among these inner products, we call the one defined below the standard inner product. 


Definition 4.3 We define the standard inner product, or sometimes dot product, between two 


vectors, X = (%1,...,%n)' andy = (y1,..-,Yn)!, in the n-dimensional Euclidean space, R”, as 


(x,y) = U1Y1 t+ + EnYn = x'y. 


The norm is thus defined as |x| = Vx'x = yx? +--: +2, and we refer to it as the Euclidean 


norm. 


The standard inner product in R” corresponds to (4.1) with A = In. (x,y) = 2>0¢_, teyn, (Ky) = 


per kTRYR, and (x,y) = p_, ELRYx are examples of inner products in R”. 


Example 4.4 Let us see that the standard inner product is indeed an inner product in R” by showing 
that the dot product satisfies the properties of an inner product, following elementary arithmetic steps. 


Given x,y,z € R” and cE R, 


© (xy) =x y = ay Ht + EnYn = Yit + + YnTn =y X= (Y, X); 


e (cx,y) = (ex) "y = (cz1)yı +-+: + (ctn)yn = c(z1y1 +- + nyn) = c (x'y) = chx, y); 


bd (x +y,z) = (x+y)! z = (x1 yı)zı a (Ti t Yn)en = (1121 Foei LnZn) t (y121 fees YnZn) = 
x'a+y!z= (x,z) + (y,2); 


e |x|? = (x,x) = x x = z? +--+ z2 >0. If |x| = 0, all z; = 0 and x = 0. It is straightforward to 
1 n 


show the converse. 


How can we check whether two directions defined respectively between the origin and two vectors, 
x = (%1,...,%)' and y = (y1,...,Yn)', are orthogonal to each other, when neither x nor y is 0, in R”? 
Assume x and y are not parallel to each other,” so that there exists a plane that passes the origin 0, x 
and y. This plane is a 2-dimensional subspace P = {ax + by : a,b € R} spanned by x — 0 and y — 0.3 
We can form a triangle in the 2-dimensional plane P with 0, x, and y as its three vertices. The lengths 
of the edges in this triangle are Euclidean norms of the edges, |x|, |y| and |x — y|. If the Pythagorean 
relationship holds, i.e., |x|? + |y|? = |x — y|?, the directions defined by x and y are orthogonal to each 


other. For the standard inner product, the Pythagorean relationship reduces to 


0 = |x|? +ly?- lx- yl? 


2The angle between x and y is either 0 or 180° if they are parallel. 
3Because any 1-dimensional subspace that passes through the origin cannot include both x and y simultaneously, it is 


easy to notice that the minimal dimension of such a subspace is 2. 


66 Chapter 4. Orthogonality and Projections 


tite tar typ te. +y2 — (a1 y) — +++ (En - yn)? 


II 


= 2x'y, 


which implies that x' y = 0 is a necessary and sufficient condition for x and y being orthogonal to each 
other. 
From here, we can think of computing an angle more generally between any pair of vectors. Again 


using the known result from 2-dimensional geometry, we know that 
lx — yl? = |x|? + lyl? — 2|x| [yl cos 9, 


when the angle between x and y is 0. Similarly to above, we can see that —2x' y = —2|x| |y| cos 0, which 


allows us to compute the angle between x and y as 


xly 


Ix||y| 
In data science, we often refer to cos@ as the cosine similarity between x and y, and use it to measure 


cos @ (4.2) 


the similarity between two vectors in terms of their angles while ignoring their norms. 


4.2 Orthogonal Vectors and Subspaces 


So far, we have shown that the standard inner product between two orthogonal vectors is zero by ex- 
pressing the angle between two vectors using the standard inner product and the Euclidean norm derived 
from it. We continue and define the orthogonality for a general inner product, beyond the standard inner 
product, which is the only geometric concept in a vector space. This allows us to derive a variety of 


interesting results later. 


Definition 4.4 In a finite-dimensional vector space V with an inner product (-,-), 


1. We say vı and v2 are orthogonal and use vı L v2 to express the orthogonality, if 


(vı, Vo) = (0), 


2. If every pair of vectors among v1,...,Vz is orthogonal, i.e., v; L v; fori Aj, we say that 


they are (mutually) orthogonal. 
3. When unit vectors are (mutually) orthogonal, we say they are orthonormal. 


4. If a basis consists of orthonormal basic vectors, we call it an orthonormal basis. 


Can we connect this generalized notion of orthogonality with the traditional notion of orthogonality? 
We start by showing that this new definition of orthogonality extends the Pythagorean relationship among 


the edges of an orthogonal triangle to an arbitrary vector space. 


4.2. Orthogonal Vectors and Subspaces 67 


Fact 4.3 Show that two vectors vı and v2 are orthogonal if and only if the Pythagorean relation holds: 
val? + [v2]? = [vi + v2}. 


Proof: Since |v; + v2|? = (vi + v2, v1 + V2) = (v1, v1) + 2(v1, V2) + (V2, v2) for any two vectors vı and 
v2, we have 


[vi + v2|? = val? + 2(v1, v2) + |vel?, 


which proves the statement. B 
We can also show that orthogonal vectors are linearly independent as well. 

Fact 4.4 If non-zero vectors v1,...,Vn are mutually orthogonal, then they are linearly independent. 

Proof: Consider cyv; +--+ + CnVn = 0 for some real c;’s. Since v;’s are mutually orthogonal, 


(vi,0) = (Vi, CV +--+ + CnVn) 
= ¢€1(vi, V1) +--+ +g (vi, Vi) +++ + Cn (Vi, Vn) 


= (ViVi). 
Since (v;,v;) > 0 and (v;,0) = 0, c; = 0. E 
We can draw the following observation about orthonormal vectors. 


Fact 4.5 Let {v1,...,Vn} be an orthonormal basis for a vector space Y. For any vector v € V, 


n 


Ve (v, vı}vı aera (V, Vn)Vn = Xv, vivi 
i=1 


holds and the representation is unique. 


Proof: Since the basis span the space, we have v = £1V1 +: + ZnVn = Fasi £V; for some real z;’s. 
From the orthogonality and normality, 


n 


n 
(v, Vj) = D LiVi,Vj) = XO ai (vi, vy) = S715 5¥3) =e, foreach f= 1.45% 
i=1 i=1 


and we obtain the desired representation. E 
Fact 4.5 implies that we can perfectly represent an arbitrary vector as inner-products against basic 
vectors in an orthonormal basis. That is, each inner product (v,v;) serves as a coordinate along the 


corresponding orthonormal basic vector. 


Example 4.5 For the Euclidean vector space R”, define the standard basic vectors 
e; = (0,...,0,1,0,...,0)" ER”, i=1,...,n 


whose i-th component is 1, and all others are 0. The standard basis in R” is {e1,...,e,}. 


68 Chapter 4. Orthogonality and Projections 


1. The standard basis in R” is orthonormal since |e;|? =e; e; = 1, e/ ej = 0 for i Æ j; 
2. For x = (21,...,¢n)' € R”, a = (x,e;) = x'e; and x = z161 +... + Znen, which confirms 


T 


Fact 4.5 for the standard basis and the standard inner product in R” since x e; = x; and x = 


(21,0,...,0)' +- + (0,...,0, £n)! = £161 +-++ + Znen. 


In mathematics, it is usual to say that a property is satisfied by two sets, if it holds between every pair 


of elements from two sets. Along this line, we can generalize the notion of orthogonality to subspaces. 


Definition 4.5 We say two subspaces, U and W, of a vector space Y are orthogonal and use 


U L W to denote the orthogonality, if 


ulw forall uc U, w eW. 


From this definition of orthogonality of subspaces, we can define the orthogonal complement. 


Definition 4.6 An orthogonal complement of a subspace W of a vector space V is defined as 


Wt ={veV:v1lw for all we W}. 


Due to the linearity of an inner product, W+ is also a subspace of V. 
Fact 4.6 WŁ is also a subspace of a vector space V if W is a subspace of Y. 


Proof: By definition, W+ C V and 0 € Wt. With a,8 € R and vi, v2 E€ WŁ, (av; + Bve,w) = 
a(vi,w) + B(v2,w) = 0 for any w € W. That is, av; + v2 € Wt. Therefore, W+ is a subspace of V. 
E 


We should warn you that two orthogonal subspaces may not be the orthogonal complement of each 
other. For instance, two subspaces of R? under the standard inner product, U = span{(1,0,0)} and W = 
span{(0, 1, 0)}, are orthogonal, but the orthogonal complement of the former, UŁ = span{ (0, 1,0), (0,0, 1)}, 
is a proper superset of W. In other words, U L W but U+ 2 W. We present and study specific examples 
of orthogonal complements in Euclidean spaces, such as Col (A! )+ = Null (A) and Null (A)+ = Col (A! ), 


later in Section 4.6. 


From the example above, where two 1-dimensional subspaces were produced from two orthogonal 
vectors, respectively, we can see that two orthogonal subspaces do not overlap with each other except for 


the origin in general. 


Fact 4.7 Let W be a subspace of a vector space V, and W+ be the orthogonal complement of W in V. 
Then, W N W+ = {0}. 


4.3. Orthogonal Projection 69 


Proof: Let w € WNW-. Since w € W and w € WŁ, we conclude (w, w) = 0 by the definition of the 
orthogonal complement. The definition of the inner product implies that w = 0. | 

According to Definition 3.3, Fact 4.7 guarantees W + W+ = W@W". Therefore, the summands 
are unique once a vector is represented as a sum of two vectors from a subspace and its orthogonal 


complements by Fact 3.1. We summarize this observation as the following theorem. 


Theorem 4.2 Let W be a subspace of a vector space V, and W+ be the orthogonal complement of W in 
Y. If every v € V has a decomposition of v = w +z where w € W and z € W+, then this decomposition 
is unique and V = Wẹ W+. 


Example 4.6 Let V = R and W = R x {0} x {0}. Then WŁ = {0} xR xR. Check that WNW- = {0}. 
For (x,y,z)! € V, (x, y,z)' = (#,0,0)'+(0,y,z)' € W+WŁ. Since WNW+ = {0}, W+W+ = Wew!, 


and this representation is unique by Fact 3.1. | 


In Section 4.5, we will further show that any arbitrary vector in a finite-dimensional vector space V 
can be expressed as a direct sum of two subspaces, any subspace and its orthogonal complement. That 
is, any vector can be uniquely represented as the sum of two vectors from a subspace and its orthogonal 
complement. 

Another problem we frequently run into is to determine whether a vector v is orthogonal to W = 
span{w,,...,wx}. Instead of checking whether (v,w) = 0 for every w € W, it is enough to check whether 


(v, wj) = 0 for each wj. 
Lemma 4.3 Let V be a vector space and {Ww1,..., Wp} be a set of some vectors in V. For any v € Y, 
vispan{w),...,wr} if and only if vlw; forj=1,...,k. 


Proof: “only if” part is clear, since all w; € W = span{w1,..., wg}. For “if” direction, assume w € W. 


w should then be written as w = £1W1 +---+2,Ww, for some z1,...,£k E R. So, 
(v, w) F (v, xıwı pens + EW) = £1(V, w1) qe eae LEV, Wr) = 0, 
and hence v L W. E 


Keep in your mind that we did not put any other condition, such as linear independence, on wj,..., Wk 


to get this result. 


4.3 Orthogonal Projection 


4.3.1 Projection onto the Direction of a Vector 


Consider a vector w Æ 0 in a vector space V. Among all the points on a line passing the origin along 


the direction w, let us pick one that is closest to another vector v. We use the norm of a vector, induced 


70 Chapter 4. Orthogonality and Projections 


by an inner product, to measure the distance. First, we parametrize all the points (vectors) on the line 
along the direction of w using A: 
{Aw : AER} 


The distance from v to an arbitrary point on this line is then |Aw — v|. We can rewrite this as the 
following: 
|Aw—v|? = (Aw—v,Aw—v) 
= )*(w,w) — 2d(v,w) + (v,v). 


This quadratic form is minimized with 
N= 


(w, w) 
The vector on a line along w, that is nearest to v is then* 


Mw = Aull w= (y, ai") l 


(w, w) [w 


According to our intuition from Euclidean spaces, the vector that represents the shortest distance, 
v — A*w, and w should be orthogonal. Indeed so, this holds in an arbitrary vector space equipped with 


an inner product as well, since 
(v — A*w,w) = (v, w) — A*(w, w) = 0. 


We thus call A*w the orthogonal projection of v onto w.° In a vector space with an inner product, we 
project a vector onto a line and obtain the nearest vector on the line to the original vector. We represent 


such orthogonal projection of v onto w as 


p(v) = --—-w. (4.3) 


(w, w) 


If w was a unit vector, i.e. (w,w) = 1, the representation can be simplified as 
p(v) = (v,w)w. (4.4) 


Because the (v,w) is linear in v, the orthogonal projection p(-) is also linear. In addition, the resulting 
vector from orthogonal projection remains the same even after further orthogonal projection onto the 


same direction, since 


y)) — POW GW w oly 
p(p(v)) = = ey) aa) = p(v). 


(w, w) 


We have introduced an orthogonal projection onto a one-dimensional subspace spanned by a single 


vector. We now generalize it to the orthogonal projection onto a subspace as follows. 


4Because w’s role here is to identify the direction, its norm/magnitude is not essential. In order to make it more concise, 


we can start from a unit vector w, i.e., |w| = (w, w) = 1. 
5Since we only deal with orthogonal projection in this book, we will sometimes omit orthogonal and simply say projection, 


for the brevity. 


4.3. Orthogonal Projection 71 


Definition 4.7 An orthogonal projection is a linear transform that maps any vector to another 


vector in a subspace W such that the direction connecting two vectors is orthogonal to the subspace 
W. We denote the projection by Pw. In the Euclidean vector space, a matrix is called an orthogonal 


projection matrix if the matrix represents a projection. 


Fact 4.8 In the n-dimensional Euclidean space R”, the orthogonal projection matrix onto a vector w is 
1 T 
w'w 


This matrix is symmetric and satisfies P? = P. 


Proof: Consider a v € R”. By rearranging (4.3) in R”, we observe 


— 
naaa 


ul 1 
i (v! w)w = ——w(w'v) = 


w, w) ~ wlw 


where the last term is a multiplication of a matrix —t ww! and a vector v. Therefore, the projection 


matrix is a rank-one matrix, as in 


1 
= —_ww'. 
w w 


It is thus trivial that P is symmetric, and P = P?, because 


2 1 
P= int (ww!) (ww!) = 
a 


Example 4.7 [Householder matrix] One representative example of a reflection matrix is a Householder 


1 


matrix, which is created from the projection matrix P = sw! toward v € R”, defined as follows: 


1 a 
H=I1-2P=I-2——wv'. 
viv 


T 


For any vector u, set v = u + |u| e1. Since v'u = ul u + |u| u1, 


v' v =u u + 2u ju] + lu)? = 2(u' u + |u| u1) = 2v' u 


holds. Therefore, the transformation corresponding to a Householder matrix based on a direction v moves 


a vector u to 


1 1 
Hu = (1- 2——w" )u= u—2— vv u =u-v = -ļuļe, = (—ul,0,...,0)'. 
viv viv 


That is, this reflection transformation keeps the size of u, but aligns the direction along the negative side 


of xı axis. The Householder transformation is important in numerical linear algebra. | 


72 Chapter 4. Orthogonality and Projections 


4.3.2 Projection onto a Subspace Spanned by Orthonormal Vectors 


Let W be a subspace of a finite-dimensional vector space V, and {w,,...,w,} be an orthonormal basis 
of W. That is, (wj;,w,;) = 1 and (w;,w,;) = 0 for i # j. We can write arbitrary w € W as a linear 
combination of these basic vectors: 


k 
W = L1W1 He +H EkWk = > aw, EW. 
i=1 


For a given v € V, we consider a problem of finding w € W that is orthogonal to v — w. This is 
equivalent to finding its coordinate x;, and the inner product in the vector space plays an important role 
in determining these coordinate values. 

According to Lemma 4.3, that the direction of v — w is orthogonal to the subspace W is equivalent to 


saying that v — w is orthogonal to every w;. We can thus get x; = (wj, v) from (w;, v — w) = 0, because 


k k 
(w v- w) = (wj,v — Y riwi) = (wj, v) — XO xi (w;, wi) = (wj,v)— zj = 0. 
i=1 i 
That is, if we set 
w= (v, Wi) Wi ates (V, Wk) Wk = So (v, wi)wi ’ 
i=1 
v — w* is orthogonal to all the basic vectors of W and therefore it is orthogonal to W. 
We can decompose v as 


v=w +(v—w’). 


Because w* € W and v — w* € W+, we see that this decomposition is unique, according to Theorem 4.2. 
In other words, there is a unique vector which a line passing v along the orthogonal direction to W meet 
on the subspace W. We call this vector the projection of v onto W and use Pw(v) to denote it. This can 


be expressed as 


k 
Pw(v) = (v,wi)wy +- + (V, Wk)}Wk = Sow, wi)w; where W = span{wi,..., wx}. (4.6) 
i=1 


The orthonormality of spanning vectors is crucial to this succint representation. Pw(v) € W, Pw(Pw(v)) = 
Pw/(v), that is, Pwo Pw = Pw. Because (v, w;)w; is linear in v, the orthogonal projection Pw(v) is also 
linear in v. Therefore, an orthogonal projection can be expressed as a matrix. When there is only one 


basic vector, i.e., k = 1, (4.6) and (4.4) are identical. 


Fact 4.9 Let Pw be an orthogonal projection onto a subspace W spanned by orthonormal basic vectors, 
{wi,...,we} C R”. Q is ann x k matrix whose i-th column is w; such that Q'Q = Ip. The projection 
matrix of Pw can be expressed as 


P=QQ'. (4.7) 


This matrix is symmetric and satisfies P? = P. 


4.3. Orthogonal Projection 73 


Proof: Let v € R” be a vector we want to project. Using the standard inner-product, (x,y) = x! y, we 


can express the projection in the form of matrix-vector multiplication, as follows: 


Pw(v) = wiwi, vV) +--+: +wpe(wy, V) 


= wiwiv+---+wrw,v 


= (wiw] ap nee oe wawi )v 
= QQ'v by Lemma 3.5. 


The symmetry and P? = P can be easily checked. a 


4.3.3 Projection onto a Subspace Spanned by Independent Vectors 


Let a subspace W of a finite-dimensional vector space V be spanned by linearly independent vectors 
{wi,...,wx}. They may not be orthonormal. We desire to get w € W, that satisfies (v — w) L W to 
project the vector v € V onto the subspace W. Note that, because W = span{wj,..., Wk}, 


(v-w) LW s (v—w)lw; foralli=1,...,k 


according to Lemma 4.3. We can represent w as W = £1W1 +--+: + 2,W, with an appropriate choice of 


coefficients, x = (r1,...,2%)' € R*. We combine these two to get 
0 = (wi,w-v) 
= (Wi, £1W1 +--+ +2,Wy — V) 
= (Wi, TW, + +++ + EWR) — (Wi, V) 


= 21(Wi,w1) +e + 2K (Wi, Wk) — (Wi, V). 


Unlike the case of orthonormal spanning vectors, we can not determine the coordinates one-by-one sepa- 
rately. Instead we have to consider the k equations at the same time. That is, the orthogonal projection’s 


coefficient x* is a solution to the following linear system 
Bx=c, 


where B = (b;;) is a k x k symmetric matrix with bi; = (wi, wj) and c = (c;) is a vector with c; = (wi, v}. 
Due to Lemma 4.1, B is invertible. Then, with x* = B~'c, we can compute the orthogonal projection 


w*,as 


Pw(v) = rw +--+ + TĂWk (4.8) 


When the basis {w1,...,w,} is orthonormal, the basic vectors are linearly independent. Then, we 
can compute the projection using this method with B = Iķ, that is, x* = c. In this case of orthonormal 
basic vectors, it coincides with (4.6). 

In the fact below, we learn a more specific way to express projection in the n-dimensional Euclidean 


space, which is popular in many applications. 


74 Chapter 4. Orthogonality and Projections 


Fact 4.10 Let Pw be projection of a vector onto a subspace W spanned by linearly independent vectors 
{wi,...,w,} C R”. Let P be the corresponding projection matrix with respect to the standard orthonormal 


basis of R”. If A is ann x k matrix whose columns are w;’s, 
P=A(A'A)"1A'. (4.9) 
Furthermore, (4.5) and (4.7) are special cases of (4.9). 


Proof: If we let bj; = (wi,wj) = wl wj, B = (bij) = A'A and B is invertible. For v € R” to be 
projected, c = A! v, and x* = B~!e = (A'A)~1A'v. The projection is then 


Pyw(v) = ctw, +--+ + rwą = Ax* = A(A'A)'A'y, 
and, therefore, the projection matrix is 
P= A(A' AJA". 


By setting A = e; and B = J;, respectively, it is easy to see that (4.5) and (4.7) are special cases of (4.9). 
a 


Example 4.8 Let S be a 2-dimensional surface in R. Consider a point on S, p = (1,2,-1)' € S. When 
two tangential vectors of S at p are (—1,0,1)' and (1,1,0)! , let us find the projection point of (2,3, 0) 


onto the tangential plane of S at p. Assuming p as the origin, the tangential subspace (plane) is spanned 


-1 1 
by two tangential vectors. Set the 3 x 2 matrix A consisting of two tangential vectors: A= | 0 1 
1 0 


and the projected point shifted to the new origin is b = (2,3,0)! — (1,2,-1)' = (1,1,1)'. Hence, the 


projection matrix is 


-1 
—1 1 
—1 0 1 —1 0 1 
0 1 
1 1 0 1 1 0 
1 0 


4.4. Building an Orthonormal Basis: Gram-Schmidt Procedure 75 


2 1 -!1 
= >s]/1 2 1 
-1 1 2 
and the projection point Pb = (3, $, 2)". | 


4.4 Building an Orthonormal Basis: Gram-Schmidt Procedure 


As we have just learned, we can project a vector onto a subspace spanned by an orthonormal basis 
by summing the vectors separately projected onto each basic vector. We think in this section how to 
incrementally find orthonormal vectors using this projection method. 

Let By = {wi,...,we} C W be orthonormal. If Wẹ = span Bk is a proper subspace of W, i.e., 
Wk S W, there exists at least one vector in W but not in Wọ, i.e., w € W \ Wọ. Using (4.6), we can write 


the projection of w onto W, as 
k 


Pw, (w) = X (w, w:)w; € We. 
i=1 
Also, w — Pw, (w) € Wz. Since w ¢ Wp, w — Pw,(w) cannot be 0. We now guess the next unit 


orthonormal vector Wķ+1 by 
1 
w — Pw, (w) 


Wk+1 = | iy Pw, (w)). 


Because Wr41 L Bk, Bk+1 = Be U {Wk+1} is orthonormal, just like By. That is, W,+1 spanned by By41 
is a subspace of W however with one more dimension than Wẹ. Because W is finite-dimensional, we 
repeat this procedure and get an orthonormal basis of W. We call this iterative process the Gram- 
Schmidt procedure. We start the Gram-Schmidt procedure by choosing any non-zero vector in W and 


normalizing its norm to be 1. This tells us the following fact. 


Fact 4.11 Let W be a finite-dimensional subspace and B be a set of orthonormal vectors in W. Then, 


there exists an orthonormal basis of W containing B, and we can construct it explicitly. 


4.4.1 Gram-Schmidt Procedure for given Linearly Independent Vectors 


Let {a,,...,a,} be a set of linearly independent vectors. The dimension of a vector space W spanned 
by these vectors is k. Just like before with orthonormal vectors, we can use the Gram-Schmidt pro- 
cedure to find a basis with special properties. That is, we follow the above procedure by setting 


W; = span{a,,...,a;} and w=aj4; € W\ W; for i < k. 


76 Chapter 4. Orthogonality and Projections 


Set w, = aa i= l; 
1. A set of i orthonormal vectors, B; = {w1,...,w;} with i < k, 
2. Compute Vier = ai41 — Pspan B; (ai+1) 
= asi ((aiz1,W1) Wi a (aii; Wiwi) (4.10) 
1 
Wi+1 vai 
3. Update Bi+ı = Bi U {wi+i}. 
4. Set i 4 i +1. Repeat while i < k. 


w;’s produced by this procedure satisfy the following properties. 


Fact 4.12 Let {a1,...,a,} be independent vectors. The Gram-Schmidt procedure above produces an 


orthonormal basis {w,,...,w,} of the subspace span{a;,...,a,} such that 
span{aı,..., aj} = span{w1,...,wW;} for J = 1.4K. 
Furthermore, each w; explains some part of a; which does not belong to span{wi,...,wj-1}, that is, 
(aj,w;) #0 forj=1,...,k. (4.11) 


Proof: We use mathematical induction on k. It is trivially true with k = 1. Assume that the statement 


holds up to k < j, that is, 
span{a,,...,a;} =span{wi,...,w;} for alli<j. (4.12) 


e In the second step of the Gram-Schmidt procedure, vj+1 = 0 if |vj;41| = 0. This implies that 
aj41 E€ span{w),...,w,;} and that aj+ı € span{a;,...,a;} due to the assumption in (4.12), which 


contradicts the linear independence among a,’s. Therefore, |vj+1| Æ 0. 
e If we rewrite the same second step by replacing vj41 with |vj+41|wj41, 
aj+1 = (aj+1;, W1)W1 +--+ + (aj, Wy) Wy + [V5 41/W541- 
Thus, aj+1 € span{wi,...,wj+41} 


e If we further divide (4.10) in the second step with |v;+1| after replacing vj41 with |vj+1|W;+1, 


1 1 
Wj+1 = paj — ((aj+1, W1)W1 ++ + (aj+1, Wg) wg). 
lvj+a| Iv5+1| 
In other words, wj+1 is a linear combination of aj; and w1,...,w,. According to (4.12), wj41 € 


span{ay,...,aj;41}. 


4.4. Building an Orthonormal Basis: Gram-Schmidt Procedure 77 


Therefore, span{ai,...,aj;41} =span{wi,...,w,+1}. 

Let us show (a;,w;) # 0. If we compute the inner products of both sides of (4.10) and wj41, we 
get (aj41,Wj+1) = (Vj41, Wy41), because wj41 L span{wi,...,w,}. Since (vj;41,w;+1) = |vj41|, we get 
(aj41,Wj+1) = |Vvj+1| > 0. 


Example 4.9 Let V be a vector space consisting of polynomials of degree less than 3. That is, V = 
{ao + aiz + agx? : a; E€ R}. For an inner product (f,g) = i f(x)g(x)dx for f,g € V, let us find an 


orthonormal basis starting with an independent polynomials {1, £, x7}. 


1. vy =1, (¥1,¥1) = Si ldz = 2, 


1 2 
V3 = 0" — (a?,wi)Wi = (a? W2)We Sh (a?, wi)W1 =r — Z, (v3, V3) = J (x? = ) dz = = 


W3 = -v3 = vV15(x? = z). 


~~ fys] 


1 
3 


By applying the Gram-Schmidt procedure to {1,2,x7} as above, we obtain the following orthonormal 


basis; 


(Fe Go ve} 


Ta? T, 
V2 V2 3 
I 
4.4.2 Projection as Distance Minimization 
Consider a k-dimensional subspace W, spanned by an orthonormal basis {w1,..., W}, of a vector space 


V. We can use for instance the Gram-Schmidt procedure above to find a basis of the full space V that 
includes the orthonormal basis of W. Let {w1,..., Wk, Vk+1,:--,Vn} be the complete orthonormal basis 


of V. If we project u onto W, we can express u as follows, according to Fact 4.5: 
u = (u, W1)W1 +- + (U, we) we + (U, Vk+1)Vk+1 $+ + (U, Vn)Vn - 


With an arbitrary vector w = £1W1 +---+ £kWk € W, 


u-w? = ((u,wy) — 21)? +-+ (u, w) ak)? + (u, vei)? H + (vn)? 


2 (u, Viti)? ea (u, Vn)”, 


78 Chapter 4. Orthogonality and Projections 


because 
u—w= ((u, w1) = r1)W1 fee ((u, Wk) — LK) We + (U, Vk+1)Vk+1 H+ + (U, Vn) Vn. 


We get the minimal distance |u — w|? when z; = (u, w;) for all i. In other words, w € W that minimizes 
ju — w/? is 


w = (u,wi)wi +--+ + (U, WE) We. 


According to (4.6), this is precisely the projection of u onto W, i.e., Pw(u). 


4.5 Decomposition into Orthogonal Complements 


Given a subspace, we will show that a vector in a finite-dimensional vector space can be expressed as a 
sum of a vector in the subspace and a vector in the orthogonal complement. Since these two subspaces 
are orthogonal to each other, we can conclude that any finite-dimensional vector space can be expressed 
as a direct sum of a subspace and its orthogonal complement. With respect to the given subspace, this 


decomposition is unique, which makes it particularly useful. 


Fact 4.13 Let V be a finite-dimensional vector space and W be a subspace of V. For any v € Y, there exist 
unique w € W and z € WŁ such that v=w-+z. Therefore, the direct sum representation V = W 6 Wt 
holds. 


Proof: By Fact 4.11, we can obtain an orthonormal basis of W. Then, for any v € V, we get a projection 
w = Pw(v) of v onto W through (4.6), and its orthogonal component z = v — w lies in W+. Then, the 


decompositions hold by Theorem 4.2. | 
B and 


we can guess this double complementation results in the original subspace W by considering a direct 


The orthogonal complementation is applicable to any subspaces. So a natural trial is (w+) 


sum example of (R x {0}) @ ({0} x R). In fact, for a finite-dimensional vector space, the orthogonal 


complement of the orthogonal complement of a subspace is the original subspace.® 


Fact 4.14 Let V be a finite-dimensional vector space and W be a subspace of V. Then, (Wt)+ = W. 


Proof: Let w € W. For any z € WŁ, (w,z) = 0 by the definition of orthogonal complement. Therefore, 
w € (W+)+, that is, W c (W+)+. 

Conversely, assume v € (W+)+. By Fact 4.13, v = w +z for some w € W and z € W+, and (w,z) = 0. 
Since v € (W+)+, we know v L z. Then, 0 = (v,z) = (w + z, z) = (w,z) + (z,z) = |z|?. Hence, z = 0 and 


v =w € W, which implies (W+)+ c W. a 


6This property does not necessarily hold in an infinite-dimensional vector space. 


4.6. Orthogonality in Euclidean Spaces R” 79 


4.6 Orthogonality in Euclidean Spaces R” 


In this section, we only consider a finite-dimensional Euclidean space and the standard inner product 
between two vectors 

(x,y) =2iyi +++ + ann =X. 
First, let us look at the orthogonality among four basic subspaces, starting from the definitions of the 


row and column spaces introduced in Section 3.1.2 and 3.5. 


Fact 4.15 Let A be an mx n matrix. The row space of A is orthogonal to the nullspace (in R”). 


The column space of A is orthogonal to the left nullspace (in R™). 


Proof: Let y € Null (A!) and b € Col(A). There exists x € R” such that b = Ax. Since y'A = 0, 
y'b = y' (Ax) = (y'A)x = 0'x = 0. Therefore, y L b, that is, Null(A') L Col(A). Considering A! 
instead of A, we can similarly show the orthogonality between the nullspace of A and the row space of 
A. E 


Because orthogonal subspaces are not necessarily the orthogonal complement of each other, the fact 
above tells us only that Col (A!) L Null (A) but neither whether Null (A) = Col(A')+ nor whether 
Null (A)+ = Col (A! ). 


4.6.1 Orthogonal Complements of the Fundamental Subspaces 


We now consider the relationship between subspaces of an m x n matrix A and their orthogonal comple- 


ments. 


e Null (A) = Col(A')+ : Let x € Null (A), that is, Ax = 0. Because the i-th element of Ax is the 
inner product between the i-th row of A and the vector x, x is orthogonal to all row vectors of A. 
x is thereby orthogonal to all vectors in the row space of A, i.e., Null (A) C Col (A! )+. 
On the other hand, assume x € Col(A')+. Since all rows of A are included in Col (A! ), the inner 


product between x and any of the rows of A is 0. In other words, because Ax = 0, we know that 
Col(A')+ c Null (A). 


e Null(A)+ =Col(A'): 


We derive Null (A)+ = Col (A! ) from Null (A) = Col (A! )+ and (Col (A')+)+ = Col (A! ) by Fact 
4.14. 


From these orthogonality results, we conclude that “The nullspace contains everything orthogonal to 
the row space” and that “The row space contains everything orthogonal to the nullspace”. If we swap 
A and A', we end up with “The left nullspace is the orthogonal complement of the column space”. We 


summarize these results as a lemma for reference. 


80 Chapter 4. Orthogonality and Projections 


Lemma 4.4 For any matriz A, 
Null (A) = Col (A!) and = Null(A)+ =Col(A‘). 
This result is part of the fundamental theorem of linear algebra. 


In Fact 4.9, we identified a matrix corresponding to a projection transformation. The projection 
matrix P in (4.7) is symmetric, and P? = P. According to Lemma 4.4, we can show that these two 


conditions fully characterize a projection matrix. 


Fact 4.16 P is a matrix representing an orthogonal projection onto a subspace of the Euclidean vector 


space R” if and only if P is symmetric and P? = P. 


Solution: Let a transformation orthogonally project a vector onto a subspace W. Let A be a matrix 
whose columns are the linearly independent vectors, and W is the column space of A. Then, according 


to Fact 4.10, we can represent the projection onto Col (A) with 
P=AA'A M" 


which is symmetric and satisfies P? = P. 
Conversely, assume that P is a symmetric matrix and that P? = P. Let W = Col (P). Then, for an 


arbitrary vector v, 


(y — Pg) € Null(P') = Col (P)+ =w 


because 


P'(y— Pv) =P! (I — P)v = P(I — P)v = (P— P’) =0. 


That is, (v — Pv) L W and Pv € W, which means that the operator corresponding to P is a projection 
onto the subspace W = Col (P). E 


4.7 Orthogonal Matrices 


Let the columns of an m x n matrix Q be orthonormal. Because orthonormal vectors are linearly 
independent, m > n. The (i, j)-th element of QTQ results from the inner product between the column 
vectors, q; and qj, which is 1 if i = j and 0 otherwise (i # j). That is, Q'Q = In and Q has a left 
inverse. With m = n when all columns of square matrix Q are linearly independent, Q is invertible and 
QT! = Q', and we call Q an orthogonal matrix. Since QQ! = QQ! = I in this case, both columns 


and rows of Q are respectively orthonormal. 


4.7. Orthogonal Matrices 81 


cos@ —siné 


Example 4.10 Both the rotation matrix in R?, R = and the permutation matrix in 
sin  cosé 

0 1 0 
R3, Q= |0 0 1} are orthogonal matrices since 

1 0 0 

= cos sinĝ| |cos@ —sin#d cos? + sin? 0 0 
R R= = wie 
—sin@ cos@| |sin0  cosé 0 sin? 0 + cos? 0 


o o iıillo1 o0 1 
and Q'Q= |1 0 ollo o 1] =|o 
0 1 0] |1 0 0 0 


In fact, (B.3) says that QQ! = I for n x n permutation matrix Q. Therefore any permutation matrix 


is an orthogonal matrix. 


4.7.1 QR Decomposition 


Consider an m x n matrix A = [ay laz] s an] whose columns are linearly independent. Because the 
columns are linearly independent, m > n. Furthermore, let {q1, . . - , qn } be an orthonormal basis obtained 
from the linearly independent column vectors using the Gram-Schmidt procedure from Section 4.4.1. Say 
Q= [q | qo| --- | qn]. According to Fact 4.12, each column a; is in span{q),...,q,;} for all j =1,...,n, 
and can be expressed as 


aj = (aj,qi)qi +--+ + (aj, 45) q5- 


These equations can be collectively rewritten as 


(ai,qi1) (a2,q1) (a3,41) =- (an, 41) 

0 (a2,q2) (a3,42) =- (An, 2) 

A= [a |a|- |an] = [q] a2] lan] | 0 0 (a3,q3) = (an, q3) 
0 0 0 “++ (andn) 


Let us denote the last upper triangular matrix as R. Then, from these, we obtain so-called QR- 
decomposition of A as 
A=QR. 


Q and R are a matrix with orthonormal columns and an upper triangular matrix, respectively. It also 
holds that (a;,q;) # 0 because of (4.11), and hence R has non-zero diagonal entries. Therefore, R is 
invertible, as it is upper triangular and does not have any zero in its diagonal. If m = n, Q is further an 


orthogonal matrix. 


82 Chapter 4. Orthogonality and Projections 


When there are negative elements on the diagonal of R, we can create another diagonal matrix D 
such that d;; = —1 if (a;,q;) < 0 and otherwise d;; = 1. In this case, the following three properties hold; 
D? = I, all diagonal entries of DR are positive, and the columns of QD continue to be orthonormal. 
We can therefore assume that all the diagonal elements of the upper-triangular matrix are positive by 


decomposing A into A= QR = (QD)(DR). 


4.7.2 Isometry induced by an Orthogonal Matrix 


Here we consider a linear transformation T : x € R” > T(x) = Qx € R” with an orthogonal matrix Q. 


With (-,-) as the standard inner product in R”, 
(T(x), T(y)) = (Qx, Qy) = (Qx) Qy =x" QT Qy =x"y = (x,y). 
That is, for any pair of vectors, x,y € R”, 
(T(x), T(y)) = (x,y). (4.13) 


In other words, the standard inner product is preserved under orthogonal transformation. In the case of 


x = y, the norm induced by the standard inner product is also preserved, as 


[Tœ]? 


= (TO T(x)) = tesy = |x/?. 


We call such a transformation that preserves the norm isometry. 
When a linear transformation l preserves the norm induced by an inner product, i.e., |€(x)| = |x|, the 
inner product is also preserved, i.e., (x,y) = (€(x), €(y)), because 
[x]? + 2(x,y) + ly? 
= |x+ yl? 
= +y] 
= [eœ +e) 
= [el + 2a), ey) + ey) 
= |x|? + 2(€(x), €(y)) + Iyl. 


| 2 


In other words, preserving an inner product and preserving the norm induced by the inner product are 
equivalent under linear transformation. 


Among the linear transformations from Section 3.8.2, following ones are isometries. 
e Rotation: it is expected to be an isometry as a rotation preserves the norm and inner-product. 


1 0 
Indeed, in R?, Rọ is orthogonal, because Ry Ro = = with Rg = 
0 1 


cos@ —sin0 Cc =s 


sin  cos@ s c 


4.8. Matrix Norms 83 


e Reflection: We can expect a reflection matrix to be orthogonal, since reflection preserves an inner 
product and the associated norm. Indeed, H! = H~!, because H! = H and H? = I for a reflection 


matrix H. 


4.8 Matrix Norms 


When working with matrices, we often want to quantify their sizes as transformations. We call this a 
norm. For instance, we would use the norm of the difference of two matrices in order to quantify the 
similarity between two matrices. There are many different ways to define a matrix norm, and here, we 
define and largely stick to Frobenius and spectral norms. In order to introduce the matrix norms, we 


first introduce the trace of an n x n matrix A = (aij) as 


trace A = a11 +: + ann- 


e Frobenius norm: We naively extend the Euclidean vector norm induced by the standard inner 


product to a matrix by” 


n d 
5 5 aj = ytrace( AAT) = y trace( ATA). 


i=1 j=1 


|Allz = 


Based on this definition, we can see that ||A||r = ||A' ||». For a matrix V with orthogonal columns 
(that is, V'V = I) and a matrix A, the following holds: 


|| AV ' ||, = trace(AV '(AV')') = trace(AV'VA') = trace(AA') = || A||}. 
Similarly, for a matrix U with orthogonal rows, ||UAl| = ||Allr. 


e Spectral norm: This matrix norm measures how much unit vectors change through the linear 
transformation defined by the matrix. That is, we define the norm of a matrix A by the following 
maximization problem 


max | Ax]. 
|x|<1 


It is not apparent whether an optimal x* exists and whether we can find such x* in this optimization 
problem. The existence of such x* and also that the norm of such x* is 1 are shown in Lemma C.3. 
This result simplifies the optimization problem above and leads to the following definition of the 
spectral norm: 


|| Al]2 = max | Ax| = max | Ax]. (4.14) 
|x| <1 |x|=1 


As examples, The spectral norms of rotation and reflection are both 1 since they induce isometries. 


Because |A(4x)| < ||All2 for x 4 0, it holds for an arbitrary vector x € R? that 


[x] 


|Ax| < JjAllļxl. (4.15) 
"If AB and BA are both defined for a pair of matrices A and B, their traces match, that is, trace(AB) = trace( BA). 


84 Chapter 4. Orthogonality and Projections 


Here are some of interesting properties of matrix norms: 
Fact 4.17 Forn xd matriz A and dx m matriz B, 

1. ||AB|l2 < ||All2||Bll2; 

2. ||AB|lr < ||Allr||Bllr; 

3. ||All2 < ||Alle; 

4. For an orthogonal matriz Q, = ||QAl|2 = ||Al2 and ||QAl| r = ||Allr; 

5. If A=u' or A=u for an n-dimensional vector u, || Al|2 = ||Al| 7 = ful. 
Proof: 

1. By (4.15), |ABx| < ||Al|2| Bx] < || Alle|| Bll2|x| for any x. 


2. Let Aie = (aii, Te stia and be; = (his, bea ae" Then, 


ABI = $ J (aides)? < SO DS laie?’ bes]? = J. Laie!” SS [bez]? = |AlPILB IF - 
i j i j i j 


4 J 


3. For |x| < 1, |Ax|? = D771 (040x)? < Xi [aie fl? < Vie, lael? = |All? - 


4. |QAx| = x’ ATQTQAx = Vx! AT Ax = |Ax|, ||QAl|2 = trace(A'Q'QA) = trace(A' A) = 
IAI- 


5. It is clear that ||A||~ = |u| from the definition of Frobenius norm. By Cauchy-Schwarz inequality, 


|u'x| < |u| for any n-dimensional vector x with |x| = 1. If we let x = rie then Ax = u!x = 


pu'u = |u|. Hence, ||A||2 = |u|. A similar derivation works for A = u. 


Fact 4.18 Every m x n rank-one matrix can be represented as uv! for some u € R™ and v € R”. An 


upper bound on both the Frobenius and spectral norms of a rank-one matriz uv' is |uj|v]. 
Proof: For any n-vector x, 


x| = u(v"x)| 
< |lulilv'x| (by regarding u as an m x 1 matrix) 
< ljulljiv iix] (by regarding v' as an 1 x n matrix) 


= |ul|v||x| (by 5 of Fact 4.17), 


regardless of the type of the norm ||- ||. E 


4.9. Application to Data Science: Least Square and Projection 85 


4.9 Application to Data Science: Least Square and Projection 


Both in natural sciences and engineering, there are many cases in which data is expressed as the relation- 
ship between an explanatory /feature vector z = (21,...,2~)' € RÝ and a dependent variable y. Consider 


a case where such a relationship is in the form of 


y= 6; f(z) qe eta On fn(Z), 


with appropriate functions, f1,..., fn : R® —> R, and constants, 0j. We assume each fi is known and 
can be computed exactly. Then, instead of z from (z,y) € R* x R, we can use x = (%1,...,2n)' = 
(fi(z),---,fn(z))' € R” and assume the following linear model of relationship: 


y = biti +--+ Ontn = 0'x, 


where 0 = (6;,...,9,)'. In reality, however, there often exists measurement error £; for each data pair 
(ai, bi) for i =1,...,m, and we assume that such measurement error is additive: 
bi =Ola te, i= 1,...,m. 


If we for now ignore measurement noise and express this linear relationship in terms of an m x n data 


matrix A, of which i-th row is a} , and a vector b = (bi,...,6m)', we obtain the following linear system: 
AO =b. 


The problem is then to find @ that satisfies this linear system, although there may not be @ that satisfies 
bi = 0! a; due to measurement noise ¢;. That is, it may be that b ¢ Col (A). 
Alternatively, we can approach this problem of finding 0 that minimizes €; = 6'a; — bi. That is,® 


d= argmin X` ge argmin X (60'a; — bj)? = argmin | Ae — b|’. (4.16) 
OeR” i1 OcR” j=] Ocr” 


4.9.1 Least Square as a Convex Quadratic Minimization 


Let us expand the objective function |A@ — b|? above: 


|A0-b|? = (A@—b, AO —b) 
= (A6,A@) — (AO,b) — (b, A0) + (b,b) 
= (A@)' A@— 2b! A@ + |b|? 
= 0' A' A0 — 2b" AO + |b}?. 


8We introduce a new notation argmin here. For a given function f, argmin,¢ a f(x) refers to an element in A that 


minimizes f. If it is clear from the context, such as when A = R”, we often omit € A. argmax is defined similarly. 


86 Chapter 4. Orthogonality and Projections 


Because |A@ — b|? is convex with respect to 0 due to Theorem A.1, the minimum is attained at the point 
where the gradient is zero. We thus compute the gradient with respect to 0, following Fact 4.19, and 
obtain 

2A' A0 —2A'b=0 HY A'AO=A'd, 


to which we refer as a normal equation. 

Often, the number m of data points is significantly greater than the number n of parameters (m > n), 
and thereby the rank of A is n. Even if the rank of A is less than n, there is no issue in assuming that the 
rank is n, since we can always reduce the number of parameters by exploiting the linear dependence until 
the columns are linearly independent. If the rank of A is n, the rank of A! A is also n due to Fact 3.8, 


and therefore the normal equation admits the following solution: 


Ô = (AT A)TIATb. 


Fact 4.19 Let Q be annxn matriz, b an n-vector, and c a real number. A real-valued quadratic function 


f: R” > R is defined as 
f(x) =x'Qx+b'x-+c. 


Then, the gradient of f is given as 
Vf(x) =Qx+Q'x-+b. 
If the matriz Q is symmetric, then V f(x) = 2Qx +b. 
Proof: Since f(x) can be written as 
n n n n n 
Pipes) = > m re n >A bri te= dkkT? + 5 >, Gist y + > GkjLet5 + 5 bizi +C, 
i=1 j=1 i=1 ixk j=l j#k i=1 


its partial derivative with respect to xz is 


of n n 

Jz, th ~-e 8n) = 2qunte+ likti +Y drejti tbe =X qizi +Y qrjzj +br = (Q"x)e + (Qx)k +br , 
k ixk j#k i=1 j=l 

which shows the desired representation. For a symmetric Q, the conclusion is straightforward. E] 


4.9.2 Equivalence between Least Square and Projection 


Because @ was the solution to the optimization problem (4.16), we can use AÎÔ = A(A™A)~!A™D as an 
approximation to b. We have seen earlier that the projection of a vector onto a subspace is the nearest 
vector in the subspace to the original vector. We then want to check whether the error vector b — AO is 
also orthogonal to the column space of A, Col (A), since A@ is the best approximation to b. According to 
Lemma 4.4, we need to check whether b—A@ € Null (A) holds, as Col (A)+ = Null (AT). By multiplying 
b — A@ with A” from left, 


A'(b— AO) = A' (I — A(A'A)"1A4")b = (A! — A'A(A™A)14')b =0 


4.9. Application to Data Science: Least Square and Projection 87 


implying that this error vector is indeed orthogonal to the column space. In summary, the vector in 
Col (A) that is closest to an arbitrary vector is its projection, and finding this nearest vector corresponds 


to multiplying the following matrix 
A(A™A)1AT, 


which is identical to the projection matrix (4.9). Therefore, solutions to least square problems and pro- 


jections are equivalent. 


88 


Chapter 4. Orthogonality and Projections 


Chapter 5 


Singular Value Decomposition (SVD) 


We often want to represent a bunch of high-dimensional vectors in a lower-dimensional vector space, for 
instance for the purpose of visualization. We can approach this problem as orthogonal projection, and 
a key question is which subspace we want to project these high-dimensional vectors, often data points. 
It turned out that the answer to this question depends on how we measure the error arising from this 
subspace projection. A natural choice of an error measure is a sum of squared norms of residual vectors 
connecting the original high-dimensional vectors and their projects on the subspace. With this choice, 
we can formulate the problem of finding the optimal subspace as that of searching for an orthonormal 
basis of the subspace onto which orthogonal projection minimizes this error measure. We will show that 
we can identify this optimal subspace by sequentially solving one-dimensional error minimization to find 
one orthonormal basic vector at a time, while satisfying increasingly more orthonormality constraints. 
We call these orthonormal basic vectors right singular vectors, and the error from each one-dimensional 
error minimization problem is a singular value. Together with left singular vectors, we arrive at singular 
value decomposition (SVD). The right singular vectors are the optimal low-dimensional representations 
of the original higher-dimensional (data) vectors, assuming that we performed SVD on a (non-square) 


data matrix of data rows. 


Due to our design of incremental optimizations searching for right singular vectors, the singular 
value obtained at an earlier stage is bigger than the one at a later stage. The sum of the rank-one 
matrices built by the initially obtained k singular values and singular vectors turns out to be the best 
rank-k approximation of the data matrix. Furthermore, we can approximately invert any given data 
matrix, regardless of its invertibility, using singular vectors, which leads us to define pseudoinverse. This 
pseudoinverse projects data vectors onto a subspace spanned by linearly dependent vectors, corresponding 
to the least square solution, when the rank of the data matrix is not full. By translating this optimization 
formulation into the language of statistics, we arrive at principal components analysis (PCA) with the 


right singular vectors correspond to the principal components in PCA. 


Kang and Cho, Linear Algebra for Data Science, 89 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


90 Chapter 5. Singular Value Decomposition (SVD) 


5.1 A Variational Formulation for the Best-fit Subspaces 


Let A be an n x d matrix whose rows correspond to d-dimensional (data) feature vectors. If a; € RI 
were the i-th row of A, A encodes the same information as {a; : i = 1,...,n} which is a set of n 
elements. Here, we consider a problem of finding a k-dimensional subspace W that represents the data 
set {aj : i =1,...,n} well, assuming k < d. Among many possible criteria to measure how well such a 
subspace represents data, we use the square of a vector norm with singular vector decomposition (SVD). 
That is, we find the optimal k-dimensional subspace W to minimize the “sum of squares of residuals”, 
as in 


n 
x 


W = = argmin Dp |a; — Pw(ai)|”. (5.1) 
W:dim(W)<k 424 

Throughout this chapter, we use the standard inner product in R¢. Up until now, we were mainly 

concerned about how to compute the projection, but here we are more concerned about how to determine 

the direction of projection. 


Before going further, there are two questions that arise naturally here: 


1. Why do we use the sum of squares? From the perspective of machine learning, the sum of squares 
is a special case of a decomposable loss function, $>; £ (ai, Pw(a;)). With the sum of squares, we 


can use the Pythagorean theorem to simplify it dramatically. Because 
2 2 2 
lai!” = |Pw(ai)|" + Jai — Pr(as)| 


and 5>""_, |a;|? is constant, we can rewrite (5.1) as 


n 
* 


w = Pw(a;)|°. 5.2 
tee APW 6.2) 


We use this form later when we derive SVD. Of course, this does not prevent us from using another 


loss function, but it must be determined for each loss function whether we can derive a simple 


solution. 


2. How do we express a k-dimensional subspace? Although we can use k arbitrary, but linearly 
independent vectors to do so, we use k orthonormal vectors, called right-singular vectors, instead 


for the simplicity of orthogonal projection. 


5.1.1 Best-fit 1-dimensional subspace 


We start with the 1-dimensional subspace that best represents the data set {a; : 1 < i < n}. We constrain 
the basic vector of such a subspace to be a unit vector and use the standard inner product to measure 
the norm of the projection onto this unit vector. If we let v be the basic vector of such a subspace, the 


projection of a; onto this subspace is 


Pspan{v} (a;) = (ai, v)v, 


5.1. A Variational Formulation for the Best-fit Subspaces 91 


according to (4.4). The lengths of the data vectors after projection are then {|(a;,v)|: 1 <i < n}, and 


we can measure how well the overall data set is represented by 


n n 


2, [Prani (a;)|" = Xa, v)? = > (a; Tv)”. 


i=1 i=1 
Because a; is the i-th row of A, this quantity is equivalent to |Av|?. We can thus find the best vı to 


approximate the n x d matrix A by 


n 


vı = argmax |Av| = argmax X (a; v)’. (5.3) 
vER4:|v|=1 veR?@:|v|=1 54 


From Lemma C.3, we know that a solution exists to this problem, but there may not be a unique solution, 


as any unit vector would be a solution if A were for instance an identity matrix. 


A Vector Orthogonal to a Set of Vectors 


When looking for an optimal subspace of dimension greater than two, we must solve the problem of 
finding a vector in a (k + 1)-dimensional subspace that is orthonormal with k vectors, of course with 
k<d. 


Lemma 5.1 Jf {vi,..., vk} C Ri are orthonormal and W is a (k +1)-dimensional subspace, there exists 
at least one non-zero vector w E€ W that satisfies (v1, w) =--- = (vk, w) =0. In other words, there exists 


a non-zero vector in W that is orthogonal to all the vectors in {v1,...,VK}- 


Proof: Let {wi,...,wx+1} bea basis of W. Then, we can map a vector W = £1W1 +--+: +2%41Wk41 E W 
to its coefficient x = (x1, ..., £41) E R*+!. Considering this correspondence, a non-zero solution to the 


following linear system is an orthogonal vector w if it exists: 
(Vi, W) = £1 (VgyWi) +++ Ni Wee) =0, t= 1,...,k. 


In other words, it is equivalent to finding a non-zero solution to Bx = 0, where 


B= ((vi, Wi) ies, 1<j<k41° 


Because there is one more variable than the number of equations, there must be a non-zero solution 
according to Lemma 3.2. 
E 


5.1.2 Best-fit 2-dimensional subspace 


We can find the best-fit 2-dimensional subspace for the dataset {a1,...,a,} by solving the following 
optimization problem: 


(wi, w32) = argmax p9 |Popantwi w2} (a;) |’. (5.4) 


wi,w2ER?: i=1 
|wi|=|w2|=1, 
(wi,w2)=0 


92 Chapter 5. Singular Value Decomposition (SVD) 


Solving this problem does not however shed light on the relationship between w} and w3. We can instead 
use the best-fit 1-dimensional subspace vı in place of w; and try to solve this problem (5.4). That is, we 
greedily find və given vı as follows 
1. vı = argmax |Av| 
vER4:|v|=1 
2. vo = argmax |Av| 
vER?:|v|=1, 
(v,vi)=0 
By Lemma C.4, we know the existence of a solution to 2 above. As the term ‘greedy’ implies, we 
repeatedly solve the one-dimensional optimization problem to find the optimal vector within a subspace 
orthogonal to the already identified subspace. In the case of two dimensions, this corresponds to finding 
the one-dimensional best-fit vector first and then subsequently finding the next one-dimensional best-fit 
vector. We now check whether {v1, V2}, from this procedure, is as expressive as {w], w3}. First, we see 


that the original problem can be written down as 


(wi,w5)= argmax |Aw,|? + |Awo|, 
w1,woER?: 
|wi|=|w2|=1, 
(w1,w2)=0 
because 
” 2 ” 2 
5 (Popat a] = 5 |(a;, w1)w1 T (ai, W2)Wo| 
w=1 El 
n 
= X (awi)? + (a;, we)? since (w1, w2) = 0 
w=1 
2 
= | Aw, | + |Awo| 3 


due to the orthogonal projection (4.6). Let Ww = span{w},w3} be the optimal subspace found by solving 
this problem. It is enough to show |Aw}| + |Aw3| < |Avi| + |Avol- 

Assume we have found v, by solving the first one-dimensional problem. According to Lemma 5.1, there 
exists a non-zero vector w € W` such that (v1, w) = 0, to which we refer as W2. Once we find a unit vector 
Ww, in W“ orthogonal to given W2 using the Gram-Schmidt procedure from earlier, {W,,W2} is a basis of 
W*. Because Wo» satisfies (v1, W2) = 0, it is a feasible solution to the second one-dimensional problem and 
it must be that |Aŵə| < |Ava|. Furthermore, because W, satisfies the unit vector constraint of the first 
one-dimensional problem, it holds that |Aw,| < |Avi|. Since both | Awj| + |Aw3| and |Aw,| + |Aw2| are 
squared sums of projections onto W*, they coincide to each other, that is, | Awy|+|Aw$| = | Aw,|+|Awgo]. 


Therefore, the greedy approach results in the optimal 2-dimensional subspace. 


5.1.3 Best-fit k-dimensional subspace 


Similarly to the earlier cases, we find the optimal k-dimensional subspace, W* = span{wij,...,w;}, for 


the data set {a; : 1 < i < n} by solving 


5.1. A Variational Formulation for the Best-fit Subspaces 93 


i=l 


s.t. |w;| = 1 E — hy away lh 


(wi, w) =0 fori#g 


Let us now consider a greedy approach to finding this subspace. 
A greedy procedure to find the best-fit k-dimensional subspace of an n x d matrix A 
1. Set the first vector (by breaking ties arbitrarily) as 


vı = argmax |Av|; 
veR?: |v|=1 


2. For j =2,...,k: {vi,...,vj-1} is already known and set 


We use Lemma C.4 to show the existence of a solution to (5.5). Let {vi,..., Vg } be a basis discovered by 
the greedy approach. Assume the greedy approach has successfully found the optimal (k— 1)-dimensional 
subspace so far. Thanks to Lemma 5.1, there exists a unit vector w, in W* that is orthogonal to every 
vector in {v1,...,Vz—1}, which allows us to find a basis of W*, {W,,...,W,_1, Wg}, that includes Wy. 
Because (v1, Wk) = 0,..., (Ve-1, WE) = 0, Wz is a feasible solution to the k-th optimization problem and 
hence |Aw,| < |Av,;| holds. Since span{v1,...,v,—1} is an optimal (k — 1)-dimensional subspace, we also 


know 


ginny 


[Aw +--+ Awal = So Popart, a} (i) | 


7 2 
SS) Peapod = Ava? +++ + [Ave—a]?. 


i=1 
By combining two inequalities, we get 
$ 2 
5 [P pan{® ag w} (ai) | = | Aw,|? iere | Aw;, |? 
i=1 
< |Avi|? +- + |Avz/? 
which implies, since W* = span{w,,..., wz}, 


X |Pw- (a;)|” < |Av,|? +--+ + |Avz|?. 
i=l 


94 Chapter 5. Singular Value Decomposition (SVD) 


Therefore we can use the greedy procedure above to find the best-fit k-dimensional subspace. 
We call the unit vector v;, obtained by solving the i-th one-dimensional problem, the i-th right 


singular vector and define the i-th singular value by 
Oi = |Av;|. 


For right-singular vectors, we have a freedom of choosing the sign of each v; since —v; is also an optimal 
direction once v; is optimal to the i-th optimization problem. Based on the derivation of the greedy 
procedure above, we see that c; > o;41 for all i and that this procedure of iteratively finding singular 


vectors terminates when 0,4, = 0. That is, 
01 209 È + 20, >0= 0741. 
We further define the i-th left singular vector, u;, as 


1 
u; = — Av;, 
Ti 
which implies that 


Av; = 0;U;. 


In summary, we end up with the following theorem stating that the greedy procedure above finds the 


optimal subspace. 


Theorem 5.1 Let vi,...,v, be the right singular vectors of ann x d matrix A. Then, span{v1,..., Vk}, 
with 1 < k < r, is the best-fit k-dimensional subspace of the rows of A (in terms of the sum of squares of 


residuals.) 


Ifo, > ar41 = 0 for r < d, |Av,41| = 0, meaning that every a; is fully contained in a subspace spanned 
by {V1, ees Yr} In other words, from ai = (a;,V1)V1 a ai (ai, Vr} Vr, |a;|? E (ai v) heh (aj, Vr)? 


holds for any i, and thereby 


n 
|All = 50 las? = Avi)? +--+ Av, ? = of +--+ 07. (5.6) 
i=1 
What does it mean to explain the data set {a,,...,a,} using the best-fit k-dimensional subspace with 


k smaller than the minimal dimension r of a subspace that can fully contain the data set? Although 
|Av;,41| Æ 0 in this case, we can perform a variety of analyses on the data set by treating each data point 


as a linear combination of k prototypes, as long as |Av,+1| is small enough. 


5.2 Orthogonality of Left Singular Vectors 


From the constraints of greedy one-dimensional optimizations, we impose the orthogonality on the right- 


singular vectors. A natural question is whether the left-singular vectors are also orthogonal to each 


5.3. Representing SVD in Various Forms 95 


other. The answer is yes even though it not apparent since the left singular vectors are defined through 
a matrix multiplication. Assume left-singular vectors, u;’s, are not orthogonal. Let i < j be two indices 
of left-singular vectors not orthogonal to each other, that is, (u;,u;) Æ 0. By choosing the sign of vj 
appropriately, we can further assume that (u;,u,;) = ô > 0, without loss of generality. For a small positive 
constant e€ > 0, define a unit vector by 
vi + ev; 1 
Ws, E + al ~ VIi+e eR 
By multiplying both sides with A, we get 


Hb 


Avi, = —— (cil; + €0;U;). 


'  V/1+e 


The squared norm is then 


|Avi/? = (o? + 26b0;0; + e07) 


> (1-—e@)(o? + 25010; + e07) 
> (1-—e@)(o? + 2ed0;0;) 


= oF H e( eo? H 2(1 e )ôoic;) 


> 0 


because 2(1 — e?)ĝcio; > eo? for a sufficiently small e. v; and v; are orthogonal to all ve with £ < i, and 
thus vi, is feasible to the i-th optimization problem in the greedy procedure. The final objective value 
attained with v; is however greater than the optimal objective value øo; = |Av,|, which is contradictory. 


Therefore, all u;’s are orthogonal. 


Lemma 5.2 The left singular vectors u,;’s of a matrix A defined as 


for each right-singular vector v;, are orthogonal to each other. 


5.3 Representing SVD in Various Forms 


Let (c1, v1, u1) be a singular triplet satisfying Avı = cu. Can we come up with a simple matrix A, 
that represents the relationship encoded in this singular triplet? One way to measure the simplicity of a 


matrix is its rank. It turned out we can define a rank-one matrix A; from this singular triplet as 
A, =o uv, ’ 


which satisfies Avı = o,u,. Let (02, V2, U2) be another singular triplet of the matrix A, where vı and 
vz are orthonormal, and similarly u; and uz are orthonormal. We build a corresponding rank-one matrix 


O2U2Vs . Let us add these two rank-one matrices to get 


T T 
Apo = 01u11 V; + O2U2V5. 


96 Chapter 5. Singular Value Decomposition (SVD) 


From the orthonormality of right-singular vectors, we can easily see that A» satisfies both Agv; = o,u, 
and Agv2 = ooUe, and that A» is a rank-two matrix from Fact 5.4. From this observation, we can 


intuitively guess that we can obtain 
rank A 


T 
A= y OiliV; 
i=1 


using as many singular triplets as rank A. 
Let o, > - > Oop > Or41 = O for an n x d matrix A. For this matrix, we have two very useful 
representations, sum of rank-one matrices and product of matrices with orthonormal columns and a 


diagonal matrix. 


«e A=) o;u;v; : Consider a basis {v,,...,V,-,Vr+1,---;Va} of R@ that includes {v1,..., Vp}. We 
may that assume the basis is orthonormal thanks to the Gram-Schmidt procedure. We can then 
represent an arbitrary vector in Ri as x = zv +-+- + xavqg. Because Or41 = 0, Av; = 0 for 


j=r+1,...,d. Furthermore, because z; = v/ x for all i, we see that 


r r 
T > 
Ax = 7,Av, +-::+2,Av, = 21010; +++: +2,0;U, = > ciu;(v; x) = ( > oj; Jx, 
i=l i=l 


for all x € R*. This allows us to express A as A = X; ojujv/ . 


e A=UDV' or AV = UD: Here, U'U = I, = V'V, and D is a positive diagonal matrix. 


rxr r x d 


We see this by applying Corollary 3.1 to U D, and V. According to (3.6), then, UDV! = 
Xi; viv; . Therefore, A = UDV". 


GR AT y Se eee r PEENES Nath o> “ha ler 7 r X 
For A , A'u; = Osi ajujv] ) u; = (Si OjVjuj )u; = o;v;. We derive and summarize the 


analogous properties of A! below. 


T = VDU! with U'U =1I,,V'V =I, and an r x r positive diagonal matrix D 


u; = O7Vi 


5.4. Properties of a Sum of Rank-one Matrices 97 


From the first one-dimensional optimization problem to compute the first singular value, we observe 
the following result. 


Fact 5.1 ||Al|2 = 01 for any n x d matrix A. 


Proof: Because the optimization problem for finding the first singular value in (5.3) and the optimization 


problem for computing the spectral norm in (4.14) are equivalent, we get o1 = |Avi| = ||Alle. a 


5.4 Properties of a Sum of Rank-one Matrices 


Consider a matrix defined as 


k 
5 anv; , (5.7) 
i=1 
where ay > --- > ak > 0, and fuy,...,ug} C R” and {vj,..., vz} C Rf are orthonormal vectors, 


respectively. This matrix may or may not have been constructed from SVD, and we still can derive the 


following results. 
Fact 5.2 (a;i, vi, u;) is the i-th singular triplet of ys aiu; v] . 


Proof: Let A = ae ajujv;. We then extend {v1,...,v%} orthogonally to construct an orthonormal 
basis of R, {v1,..., va} via the Gram-Schmidt procedure. If we write a unit vector v in R4 as v = 


11V, +-+-+2q@vqa, we can further write Av as 


k k k k 

3 T 3 T A 2 ta 

Av = ( > | Wii; )v =) ait;v; V= ò aiti, |Av| = > 05 25. 
i=1 i=l i=1 i=l 


Under the unit-norm constraint (|v| = 1), which translates to £? +--+. + 2 = 1, thën t = 1,¢2 =3 = 


za = 0 maximizes |Av|. That is, v = vı. The first right singular vector of A is vı. Since Avy = @ıuı 
and a; = |Av;|, a1 is the first singular value, and u; is the first left singular vector. In order to get the 
second singular vector, we add the extra constraint that (v,v1) = 0 which corresponds to considering 
only v that can be expressed as v = 2V2 +: -+-+ £qaVa. Then, a2, V2 and ug are the second singular value, 
left singular vector, and right singular vector, respectively. We can recover all the remaining singular 


triplets similarly. | 


When we have a singular triplet of A in hand, does it help us find SVD of A'? If we start by 


representing A as the sum of rank-one matrices, we can constructively answer this question. From 


r T r r 
T 5 3 T 5 D Tt 5 3 T 
A = OiliV; = Oi (uiv; ) = OiVil; , 
i=1 i=1 


i=l 


we see that the role of two vectors in each rank-one summand are swapped. Thanks to Fact 5.2, the right 
(left) singular vectors of A are the left (right) singular vectors of A'. And, A and A' share the same 


singular values. 


98 Chapter 5. Singular Value Decomposition (SVD) 


For the sum of rank-one matrices, we can easily read the matrix norms from the coefficients of rank-one 


terms which are singular values. 


Fact 5.3 || va ouv] ||, =a; and || se anv, || p = „aĵ + +a}. 


Proof: They are derived from Fact 5.1, Fact 5.2, and (5.6). E 


Another important question is on the rank of a sum of rank-one matrices. 
Fact 5.4 rank DS a;ujv; ) =k. 


Proof: Let us characterize the null space of A = 2 a;ujv;. We do so by looking for a vector v € R? 


that satisfies Av = 0. Start from orthonormal vectors, v1,..., Vg, and extend them to construct a basis 
{V1 .--, Vk; Vk+1; ---, Va} of R¢. When we write v € R? as v = IS tivi with (21,...,2a)' € R4, the 
necessary and sufficient condition for Av = 0 is x; =--: = £k = 0, since 
d d k k d k 
Av = 5 xv, AV; = 5 "p a;uyjvi Jv = 5 x L{Q4U;V, Vi = y TiQiu;. 
i=1 i=1 j=1 j=1 i=1 i=1 


In other words, Null (A) = span{v,41,..-,Va}, and thereby, dim Null (A) = d — k. Then, according to 


the fundamental theorem of linear algebra (3.4), 
rank A = d — dim Null (A) = k. 


To see the usefulness of these facts, let us consider the following matrix given as a sum of rank-one 


matrices. 


Example 5.1 Let a 4 x 5 matrix A is given as 


0 1 1 
a= i 10 o of -|i 101 of -o o 2 oa: 
0 0 0 


After normalization of vectors, we get 


0 A A 
v2 v2 
0 ae al 
z Fi _ Jz) [= 4 Í v2 2 as 
A=v2! | (J, Jp 9 0 0 vaP [A ye of 0 voj [o o % o i): 
0 0 0 
We re-arrange the terms with modification the leading signs and get 
-1 =i 
Z Z : 
wl -1 0 
— v2 2 A Z|] a a a k 
a= Vi0}¥?! o % 0 Jl +v a Oo a o a ooo 
0 0 0 


5.4. Properties of a Sum of Rank-one Matrices 99 


T. T T 
= 01V; + O2U2V5 + 03U3V3 


Observe that {u;1, u2, U3} and {v1, v2, v3} are orthonormal. Fact 5.2 tells us that (0;, vi, u;) is the i-th 
singular triplet of matrix A, and thereby, ||Al]2 = v10 and ||A|| 7 = 3V2 according to Fact 5.3. We also 
read out of the rank of A, rank A = 3, using Fact 5.4. | 


With this result on the rank of the sum of rank-one matrices, we now state the singular value decom- 


position. 


Singular Value Decomposition (SVD) Any n x d matrix A (with r = rank A) can be repre- 


sented as 


A=UDV' =X omv], (5.8) 
i=1 


where nx r matrix U and dx r matrix V are matrices with orthonormal columns, respectively, and 
D = diag(o1,...,¢,) is a diagonal matrix with diagonal entries cı > --- > op > 0. u; and v; are 
i-th column vectors of U and V, respectively. Note that {u,,...,u,} C R” and {v1,..., Vr} C RI 


are orthonormal sets of vectors, respectively. 


Based on the defintion of SVD, we can relate the spectral and Frobenius norms with each other. 


Fact 5.5 For any matrix A, ||Allo < ||Alļlr - If rank(A) =r, ||Alle < vrl|Alļ|2. For rank-one matrices, 


two norms coincide. 


Proof: Although we have already proven in Fact 4.17 that ||Aļ|2 < || A|| r, we consider another proof 
based on SVD here. Let us express an n x d matrix A as A = $`; oju;v) using SVD. Then, due to 
Fact 5.3, 

Alla = 01 < fo? + +02 = Alr- 


Furthermore, 


Alle = yo? ++- +02 < yro? = Vio = VFllAlle. 
For the case of rank A = 1, ||All2 < ||Allz < V1||All2 implies || All2 = || All F. E 


Once we obtain an SVD of an invertible matrix, we can write the inverse of the matrix using the 


singular triplets. 


Fact 5.6 Let A be an n x n invertible matriz. Assume that an SVD of A is given as 
UEV" = X ov] . 
i=1 


Then, all ci > 0, and 


AY =VE U" =X ay va (5.9) 
j=1 


100 Chapter 5. Singular Value Decomposition (SVD) 


Proof: The invertibility of A implies rank A = n, and due to Fact 5.4, all o;’s are positive. Then, X71 


and jai oj“ tvju] are well-defined. Furthermore, U and V are orthogonal matrices. So, 


A= (UEV ane an = a ly 
j=1 


We will introduce the notion of pseudoinverse for non-rectangular and/or non-invertible matrices in 
Section 5.8. The pseudo-inverse can be expressed similarly to (5.9). 
With SVD, we can readily compute ||A~*||3||All2 which is called the condition number of a matrix A 


and denoted by «(A). This is a very important concept in numerical linear algebra. 


Example 5.2 Let A be an n x n invertible matrix. If an SVD of A is given as )>;_, oiuiv; , where 


01 >- >on > 0. Let us find ||A~*||2||Alj2. Facts 5.3 and 5.6 imply ||Aļll2 = cı and ||A7+||2 = ont. 
Therefore, ||4-}||2||All2 = o10n-2. I 


5.5 Spectral Decomposition of a Symmetric Matrix via SVD 


Consider a symmetric, rank-r, n x n matrix A with singular values 0; >--- > or > 0 = ar41. Because it 
is symmetric, both right and left singular vectors are R” vectors. Let (o1,v1,u1),...,(0j-1, Vj—1,Uj—1) 
be the first (j — 1) singular triplets. We further assume either u; = v; or u; = —v; holds for all of these 
known first (j — 1) singular triplets. If we denote by (øj, v, u) the j-th singular triplet obtained by solving 
the j-th optimization problem in the greedy SVD procedure (5.5), v L {vi,...,vj—-1} according to (5.5), 


and u L {u;,...,u,;—1} according to Lemma 5.2. Because of the assumption that u; = +v; for i < j — 1, 
(v +u) L {vi,...,vj-1}. (5.10) 
As A = A' , both Av = cju and Au = c;v hold, and one of the following two cases holds as well: 


e v+u #0: Let vj = waa (Y +u), and we get Av; = 0,;v;. Because vj satisfies |Av;| = oj and 


is also feasible to (5.5), vj is the optimal solution to (5.5). Hence, v; is eligible for the j-th right 


singular vector and also the j-th left singular vector; 


e v+u=0: Let vj = v, and then v; is the j-th right singular vector. Since v; = —u, Av; = cju = 


—ojv;, which means that u; = —v; is the j-th left singular vector. 


That is, either way, we get the j-th singular triplet (oj, vj, uj) with u; = +v;. By repeating this procedure 
for r times, we obtain r singular triplets where each pair of right and left singular vectors are parallel to 


each other. 


5.5. Spectral Decomposition of a Symmetric Matrix via SVD 101 


If we set A; as o; when v; = u; or —o; when v; = —u;, then Av; = A;v; holds for all i, resulting in r 
scalar-vector pairs, (A1,Vv1),---, (Ay, Vr). This simple relationship for the pairs are named in the following 
definition. 


Definition 5.1 For a square matrix A, a scalar À and a vector v are eigenvalue and eigenvector 


of A, respectively, if they satisfy 


Av = Àv. (5.11) 


This vector equation is called an eigen-decomposition and the pair (A,v) is called an eigen- 


pair.® 


“We study much more in-depth eigenvalues and eigenvectors in Chapter 9. Until then, all we need is the definition 


of eigenvalues and eigenvectors satisfying (5.11). 


After obtaining the eigenpairs (Ai, v1),...,(Ar, Vv) of A for the case where left and right singular 


vectors are parallel, we can express the matrix A as the sum of rank-one matrices, as follows: 


r r 
X R | ) > T 
A= Cil; V; = AÀiViV; . 
i=1 i=1 


Starting from these r eigenvectors, we can get n orthonormal vectors {v1,.. . , Vn } using the Gram-Schmidt 
procedure. By letting Ar+1 =+- = An = 0, we end up with n eigenpairs, (A1, V1), -, (An, Vn). These 
eigenvalues are all real, as they are either singular values themselves or their negations. We call such 


decomposition of a matrix spectral decomposition. 


Theorem 5.2 (Real Spectral Decomposition) Let A be a real symmetric matrix of rank r. 


Then, A can be represented as 


A=VAV' =X Aviv], (5.12) 
E 


where V is an orthogonal matrix with orthonormal columns v1, V2,...,Vn, |Vi]| = 1 and A = 


diag(à1,..., An). àp s and v;’s are the eigenvalues and eigenvectors of A, respectively. There are 


exactly r non-zero eigenvalues. 


Real spectral decomposition is the most popular form of so-called eigendecomposition, which we 
will learn more later in Section 9. Later in Appendix F, we provide another proof of the real spectral 
decomposition that does not rely on SVD. We highly recommend you take a look at the alternative proof 
together. We demonstrate the consequences of this theorem using the following symmetric matrix given 


as the sum of rank-one matrices. 


102 Chapter 5. Singular Value Decomposition (SVD) 


Example 5.3 Let a 4 x 4 matrix A is given as 
1 


þh 1 o of -2/7 1 0 o; 


O O m.e me 


1. Observe that [0 0 1 0]', [1100]', [1 —100]' are mutually orthogonal. Let vı = [0 0 1 0]!, vz = 


Zl 100]', v3 = [1 -100]'. Then v;’s are orthonormal and 


A= 2vıv] — 6vevy + 4v3v3 . 


If we multiply A with vı from right, we obtain Av; = 2v1, because v;’s are orthonormal. Repeating 
it for all v;’s, we arrive at the eigenpairs, (2,v,), (—6, v2) and (4,v3). Moreover, if we multiply A 
with a vector perpendicular to vı, v2 and v3, such as v4 = [0 0 0 1]', we get Av, = 0. In other 


words, (0, v4) is also an eigenpair. 
2. We can slightly modify decomposition in 1 by letting u = v1, ug = —v2 and ug = v3. Then 
A=2uiv] + 6ugv3 + 4ugv4 . 
We effortlessly get a (compact) SVD from this decomposition: 
A=UDV", U = [uglus|u], V = [velvslvi], D = diag(6, 4, 2). 
The singular triplets are then 
(6, U2, V2), (4,U3,v3), (2, u1, v1). 


u,’s are left singular vectors, and v,’s are right singular vectors. 


| 
Let P4,,, be a projection transformation onto span{v1,..., Vg}, where {v1,..., Vg} are k eigenvectors 
of A that have the same corresponding eigenvalue À; from Theorem 5.2. That is,! 
k 
Pin) AG) = Peto (5.13) 
j=l 


If A; A Aj, the vectors constituting P4,,, and P 4,\; are orthogonal and P4,),P.4,,; = 0 holds. In addition, 


because P4); is a projection, the following hold: 


2 aly 
Para; = Paa and Paar, = Paa. 


1We treat the linear operator and its corresponding matrix interchangeably here. 


5.5. Spectral Decomposition of a Symmetric Matrix via SVD 103 


When there are r distinct eigenvalues A1,...,A;, we can also write real spectral decomposition (5.12) 


succinctly as 


Consider an example of such decomposition. 


200 0 
020 0 
A = 
0030 
00 0 3 
1000 0000 0000 0000 
0000 0100 0000 0000 
= 2 +2 +3 +3 
0000 0000 00 1 0 0000 
0000 0000 0000 000 1 
T al T F 
1] }1 oj j0 oj J0 0} j0 
0} |0 1] |1 oj |0 oj jo 
= 2 +2 +3 +3 
oj J0 oj jo 1] }1 oj |0 
oj [0 oj [0 oj [0 1] [1 
T TE, als T 
1| |1 0} |0 oj j0 oj J0 
oj J0 1] }1 oj jo oj j0 
=e ate +3 F 
oj jo oj Jo 1| |1 oj J0 
oj [0 oj [0 oj [0 1j |1 
1000 0000 
0100 0000 
=<% +3 
0000 0010 
0000 0001 
= 2P42+3P43 


Even if we choose to use different orthonormal vectors in the third step above, we end up with the same 


projection-based expression: 


oO 
So 


+3 


(ae ee en = 
Oe o D 
Oo 

o 2 


104 Chapter 5. Singular Value Decomposition (SVD) 


3 4 0 0 3 -4 0 0 000 0 000 0 
3 4 0 0 -4 $ 00 000 0 000 0 
= 2 +2 +3 +3 
000 0 0 0 00 001 0 000 0 
000 0 0 0 00 000 0 000 1 
100 0 000 0 
0 10 0 000 0 
= 2 +3 
000 0 00 1 0 
000 0 000 1 
= 2P42+3Pa43 


5.6 Relationship between Singular Values and Eigenvalues 


As clear in Definition 5.1, eigenpair is defined only for square matrices. Even for real square matrices, 
eigenvalues and eigenvectors are not guaranteed to be real if they are not symmetric. There might be a 
fewer independent eigenvectors than the number of columns. The real spectral decomposition however 
guarantees the existence of real eigenvalues and enough orthonormal eigenvectors for symmetric matrices, 
whereas SVD always results in as many singular triplets as the rank of a matrix of any size. As a further 
difference, singular values are always non-negative, but eigenvalues of the corresponding matrix may be 
negative. In short, SVD and eigendecomposition are not equivalent. 

For a symmetric matrix, we can however correspond each singular triplet with the eigenpair of the 
symmetric matrix by modifying the greedy procedure (5.5) for SVD. We can also obtain an SVD from 
eigenvalues and orthonormal eigenvectors of a symmetric matrix. Assume we know all eigenpairs (Aj, vi), 
i=1,...,n of ann x n matrix A with orthonormal eigenvectors {v1,...,v,}. Because eigenvectors are 
orthonormal, we can write A as A = 77", \iviv. Let rank A = r or, equivalently, there be r non-zero 
eigenvalues, i.e., |A] > |A2| > = > JA] > 0 = Arp = +++ = An. Then, by letting c; = |A;| and 
u; = sign(A;)v; for i = 1,...,r, we get an SVD of A, as A = yy o;u;v; . Let us look at the matrix 


norm of a symmetric matrix in terms of its eigenvalues. 


Fact 5.7 If a symmetric matrix A has eigenvalues \1,...,An, then 
|| All2 = max |A;| = max |x" Ax| and ||Allr =./A7 +---+2. 
l<i<n |x|=1 
Proof: Let A = $; Aviv} , and assume |A;| > |A;+1| for convenience. As above, set o; = |A;l. 
We also set u; = —v; if A; < 0 and u; = A; otherwise. Then, we have an SVD representation of 
A= J; o;uiv; , and the first and the third equalities hold by Fact 5.3. Note that \-41 = ++: = An =0 


if rank A =r <n. For |x| = 1, |x! Ax| < |x||Ax| = |Ax| < || All and maxjxj=1 |x" Ax| < |All = ll- 


Because maxjxj=1 |x" Ax| > |v] Avi| = |A1|, the second equality holds. a 


5.6. Relationship between Singular Values and Eigenvalues 105 


Singular triplets of A and eigenpairs of A'A 


It is common to consider eigenpairs of A'A or AA! and their correspondences to the singular triplets 
of A, when A is not square nor symmetric. Let A be an n x d matrix of rank r. Assume that (ø, v,u) 
is a singular triplet of A such that Av = ou and A'u = ov. Then, A’ Av = cAlu = o?v and 
AA'u =o Av = g°u which implies that (07, v) and (a7, u) are eigenpairs of A' A and AA', respectively. 

Consider instead an eigenpair (A,v) of AT A, where A Æ 0 and |v| = 1. Since A’ Av = Av, which 
implies that |Av|? = Alv|? = A after mulitiplying both sides with v', A > 0 if A 4 0. We also see that 
ju] = 1 and Alu = V)v, if we let u = Av. That is, (VX,v,u) is a singular triplet of A. Since ATA is 
symmetric, Theorem 5.2 allows us to assume that all eigenvectors of AT A are orthonormal. Let (A, 7) be 
another eigenpair of AT A such that ' v = 0. By letting û = Do similarly with u, u and û are also 


orthonormal since 


Therefore, we get r triplets, (VAi, Vi, u;), with orthonormal v; and u; and positive \;. The sum of rank- 
one matrices induced by these triplets, X;—4 vVAiu;v] , results in the same vector as the matrix A when 


multiplied from right by an arbitrary vector. Thus, they are equivalent, i.e., 
T 
A = 5 V/ Aww:V; , 
i=1 


and these r triplets are the singular triplets of A due to Fact 5.2. Many commercial software packages 
compute singular values by computing eigenvalues of AA! or A' A. Though, such decomposition is often 


not unique, since the correspondence between singular values and eigenvalues is not unique. 


Lemma 5.3 Assume that A is a real matrix of arbitrary size. Then the square of any singular 


value and right singular vector of A is an eigenpair of A! A, and the left singular vector is an 


eigenvector of AA'. Conversely, if A' A admits eigenvalues and orthonormal eigenvectors, then 


the square roots of eigenvalues are singular values of A, and the eigenvectors form right singular 


vectors. 


Symmetrization 
We define the symmetrization s(A) of an m x n matrix A as 


0 A 
A)= : (5.15) 
on A’ 0 


As the name suggests, the resulting (m + n) x (m+n) matrix is symmetric. Symmetrization is linear, 


since s(A+aB) = s(A)+as(B). Consider R™-vector u and R”-vector v, as well as the following vectors 


u u 
of size (m + n) to work with symmetrization: w = and w = . Then, 
v v 


106 Chapter 5. Singular Value Decomposition (SVD) 


e If (o,v,u) is a singular triplet of A with o > 0, |u| = 1, and |v| = 1, 


Av ou 
s(A)w = at = = ow. 
u ov 


Therefore, (o, w) is an eigenpair of s(A). 


e Conversely, let (\,w) an eigenpair of s(A) with \ 40. Then, both Av = Au and A'u = Av hold, 


because 
Av Au 
s(A)w = = \w = 
Alu Av 
Neither v nor u, which constitute the eigenvector w, can be zero vectors. If one of them is zero, the 
other has to be also zero from the singular relations. This implies w = 0, but the eigenvector w is 
not a zero vector. Therefore, v is an eigenvector of A' A corresponding to an eigenvalue A?, and |A| 


is a singular value of A, according to Lemma 5.3. 
e In addition, if (A, w) were an eigenpair of s(A), 


; Av (—A)(—u) : 
s(A)w = = = (-A)w, 
eV e PO 


because Av = àu and A'u = Av. Therefore, (—A, W) is also an eigenpair of s(A). 


We obtain the following result summarizing these observations. 


Lemma 5.4 The symmetrization s(A) has its eigenvalues in both signs; that is, if A is an eigen- 


value of s(A), then both +|A| are eigenvalues of s(A). There exists a one-to-one correspondence 


between the singular values of A and the positive eigenvalues of s(A). 


This lemma is useful when we convert any result on the eigenvalues of a symmetric matrix into that 


on the singular values of a matrix of arbitrary size. 


5.7 Low rank approximation and Eckart-Young—Mirsky Theo- 
rem 
Suppose that an SVD of a matrix is given as 
ADS rani, o1 > > 0, >O = Orqa. 
t=1 


Let A; be a matrix resulting from taking the summands that correspond to the largest k singular values, 


as follows 


k 
Ax = Ss owuy, (5.16) 
i=1 


5.7. Low rank approximation and Eckart-Young—Mirsky Theorem 107 


Then, A— Ak = Xori o;u;v; . Based on Fact 5.4, the ranks of A, Ap and A — A, are r, k, and r — k, 


respectively. 


Low Rank Approximation: Spectral Norm 


Lemma 5.5 For any matrix B of rank at most k, || A — Ax|l2 < || A — Ble. 


Proof: Suppose that B is an arbitrary matrix of rank at most k. 
e Because A — Ak = Yj 1 XUV, » k41 2+ DOr >O, 
|A — Agll2 = o%41; 
according to Fact 5.3, for k < r. 


e The rank of the null space of B is at least d— k, since the rank of B is at most k. Let {v1,..., Vk+41} 
be the right singular vectors of A that correspond to the k+1 largest singular values. The dimension 
of Null (B) Ñ span{v1,...,Vg+1} is at least 1, as (d—k)+(k+1) > d, and the intersection includes 


a non-zero unit-vector z. Note that Bz = 0 from z € Null(B) and z = Da (z,v;)v; from 


z € span{v),...,Vz11}. Then, 
r k+1 k+1 
Az = (~~ aanv ) (Xe vj)¥)) = 5 O;(Z, Vi) Uj. 
i=1 j=l i=1 
Thus, 
|A- Bl2 > |(A-B)z| (by the definition of spectral norm) 
= |Az| (by Bz = 0) 
k+1 1/2 k+1 
= $3 a (2, vi)?) (by Az = 5 Ti (z, viui) 
i=1 i=1 
k+1 / 
2 onts( Sz, vi) ) (by 01 2 +++ 2 on 2 O41) 
i=1 
k+1 
= Ok+1 (by So (2, vi)? = |z? = 1) 
i=1 
= |A- Aull. 
Therefore, ||A — Ax||2 < || A — Bl|2 holds for any matrix B of rank at most r. E 


Low Rank Approximation: Frobenius Norm 


Lemma 5.6 For any matriz B of rank at most k, |A — Arl||r < || A -— Bll. 


108 Chapter 5. Singular Value Decomposition (SVD) 


Proof: Assume that ||A — B||r is minimized by the n x d matrix B of rank at most k. Let bi,..., bn 
be the rows of this matrix B. We further assume that the projection of the i-th row a; of A onto 
span{bi,...,b,} is not b;. That is, Pspan{bi,...,b,}(@i) Æ bi. We create B’ from B by replacing b; with 
P nant, hb, (aa). Then, ||A— Ble > |A— Bly, sinee lay — ty > |a; — E E ETT) (due to the 
distance minimizing property of orthogonal projection) and || A — B|} = )>7_, la; — b;|?. The rank of B’ 
is however not greater than that of B, since Pspan{bı,...,bn} (ac) is a linear combination of the rows of B. 
This is contradictory with the assumption of minimal Frobenius norm, and thus b; = Pspan{b1,... bn} (ai). 
That is, 


n 


i A-B} = i a; —P ar. 
B:rank(B)<k I lF Birank(B)<k 2, |as — Pspanfb1,....bn } (a:)| 
Finding span{b;,..., bn} among matrices of rank at most k is equivalent to finding a subspace of dimen- 


sion up to k, i.e., 


n n 
` 2 A 2 
m TDA T a O ia 


Furthermore, because we can represent a subspace with a basis, 


n mn 

2 2 

min |a; = Pa(ai)| = min : > |a; = | nn E 9] : 
Bete To a a tie! 


By putting all these equations together, we get 


n 
P 2 ; 2 
aiala = an) A aa i] 

: = orthonormal i=1 


n 
$ 2 2 
— mun X |a;| = a E a) | 


V145+-5,VEe n 
orthonormal i=1 


. 2 = 2 
i > [a;| = a 5. ae A Tey || 
=l orthonormal i=l 
= _|Alp- max, |Avi? +--+ |Avk]’ 
V1 see VEE 


orthonormal 


= {e +++ +07)— (of +--+) by (5.6) 
= Char ti +07 


= |A- Arl} by (5.6) 


Therefore, || A — Ax|| 7 < ||A-— Bl|r for any matrix B of rank at most k. 


Consider n x d matrices A and A, in (5.16). By Lemma 5.5, 
|A- Axll2 < || A- Blle 


holds for any matrix B of rank at most k. In addition, thanks to Fact 5.3, 


|4- Ar =|] 32 oa 
i=k+1 


= Ok41 
2 


5.8. Pseudoinverse 109 


holds. We combine these results to arrive at the following Eckart-Young—Mirsky Theorem. 


Theorem 5.3 (Eckart—Young—Mirsky Theorem) For any matrix A and its i-th singular 


value ci, 


min ||A— Blo = ok+1. (5.17) 
B:rank B<k 


5.8 Pseudoinverse 


We used SVD to express the inverse of a square invertible matrix in Fact 5.6. In this section, we define 
pseudoinverse, or Moore-Penrose generalized inverse, in a similar form, for general matrices including 


non-square matrices and non-invertible matrices. 


Definition 5.2 (Pseudoinverse) Let A be an n x d matrix of rank r and A = UXV" be the 
compact? singular value decomposition of A. Then, we call the following d x n matrix as the 
pseudoinverse of A: 

At = VEU], (5.18) 


1 


where X71 = diag ( +) with the non-zero singular values o1,...,0, of A. 


4 
opt? 


“SVD of an n x d matrix A is not unique. When rank(A) = r, we say it is compact SVD if the columns of U 
and V are respectively r left and right singular vectors, and X is a diagonal matrix with r singular values on its 


diagonal. 


Both U and V consist of orthonormal columns, but they may not be orthogonal. In other words, 
we can rely only on the fact that U'U = I, = V'V. From these equalities, we know that AAt is a 


symmetric matrix, since 
AAt =UÐXV' VAU! =UU'T and AtA = VXU UÐV' =VV". 
Furthermore, we can also show that 
AA*A =UU'UÐV' =U®V' =A and AtAA* = VV' VEU! = VAU! = A7. 
We call these two properties of AAt A = A and At AAF = A* collectively as the Penrose identities. 


Definition 5.3 (Penrose identities) An n x d matrix A and a dx n matrix B satisfy the Penrose 


identities if A and B satisfy the following identities: 
(a) (AB)! = AB and (BA)! = BA; 
(b) ABA = A; 


(0) BAB =B. 


110 Chapter 5. Singular Value Decomposition (SVD) 


There is a unique matrix B that satisfies the Penrose identities with a given arbitrary matrix A. 


Fact 5.8 There is a unique matrix satisfying the Penrose identities. 


Proof: Assume there are two matrices, B and C, that satisfy the Penrose identities with a given A. 


Using the three identities, we get 


È = BAB =B8(As)' =BB'A' = BB' (ACA) = EBB A’ (A0) 
= B(AB)' AC = BABAC = (BAB}AC = BAC 
= (PA Cas’ B'C=(ACA' B C={(CA A'R O 
= (CA) (BA) C =CABAC=C{(ABAC=CAC 


That is, there is only one matrix that satisfies all three Penrose identities. 
| 
From this property, we can show that the Penrose identities are equivalent to the definition of pseu- 


doinverse. 


Fact 5.9 A dxn matriz is a pseudoinverse of A if and only if the matrix satisfies the Penrose identities. 


Proof: We already showed above “only if’, which tells us that A* satisfies the Penrose identities. 
According to Fact 5.8, there is only one matrix that satisfies the Penrose identities, and thereby this 


matrix is the pseudoinverse. 


E 
By combining above two Facts, we conclude that a pseudoinverse is unique. 
Theorem 5.4 A pseudoinverse of a matrix is unique. 
Proof: We can show this based on Facts 5.8 and 5.9. a 


As an example, let us compute the pseudoinverse as well as a low-rank approximation of the following 


matrix given as a sum of rank-one matrices. 


Example 5.4 [Example 5.3 revisited] Consider the following 4 x 4 matrix A: 
1 


lo o 2 o] -3 þh 1 o o} -2/7 1 0 o; 


Oom. o o 
oO om.r 


0 


5.8. Pseudoinverse 111 


1. We already computed the compact SVD of this matrix in Example 5.3: 


1 
= mA 0 
=F “aa 0 
= 2 1 1 2 1 1 
A= 6) Al a zo of +4) 7) [J -a 0 of +2] | [0 0 1 o 
0 0 0 
= cuv] + T2u2V] + 03U3V4, b 
Therefore, its psuedoinverse is 
1 1 1 
At = —uv] +—uv} + —ugvy 
a1 02 03 
1 1 
i z : 
= 1al a 4 o of +2] 1 [4 -4, 0 of 42/°) fo 0 1 o 
0 -v2 v2 4] 9 v2 v2 i 
0 0 0 
E 
2 2 1 1 
Lag 1/6 0 0 -2 0 0 
VD U ‘ h i 0 1⁄4 0 Ja 7 0 0 
0 0 1/2 0 0 1 0 
0 0 0 
2. By Lemma 5.5, 
h Se 4 
v2 v2 
eee A pee 
= 2 ole 7 = V2 1 1 V2 1 _ 1 
B= Ao = 01U Vv, + O2U2V5 =6 0 E Va 0 0 +4 0 E V2 0 0 
0 0 


minimizes ||A — B||2 among 4 x 4 matrices of rank 2. 


Another example of pseudoinverse is a left- or right-inverse of a non-square full-rank matrix. 


Fact 5.10 Let the rank of ann x d matrix A be d. Then, At =(A'A)~!A', and At is a left-inverse 
of A. Ifn>d, then A* is not a right-inverse of A. For the case of n =d, At is the inverse of A. 


Proof: (A'A)~1A' is the pseudoinverse of A by Fact 5.9 since it satisfies the Penrose identities with 
A. Since A+ A = (A'A)~1A'A= Ig, A* is a left-inverse of A. However, A has no inverse if n > d, and 
A? is not a right-inverse of A in this case. If n = d, then A is invertible since the rank A = n = d, and 
AY ={A' ay A" sA HAT aA! = 2, a 


Because the following holds for pseudoinverse, 


Al (AAt —T)=Vxu'(uU' —- I) =0, 


112 Chapter 5. Singular Value Decomposition (SVD) 


we can show for an arbitrary pair of vectors, x and b, that 
|Ax —bl? = |Ax— AAŤb + AA*b — b}? 
= |A(x— Atb) + (AAt — I)b|? 
= |A(x— Atb)|? + 2(x — Atb) TAT (AAt — Ib + | (AA+ — 1b] 
= |A(x— Atb)|? +04 |AAtb— b|? 
> |AAtb—b]’. 


In short, for an arbitrary matrix A and an arbitrary vector b, 


|Ax — b| > |AAtb — b| for all x € R”. 


With AT = (At)', 
A(A' A —-I)=UEV'(VV' - 0, 


because A! AT = (AtA)! = VV". From this, we obtain the following inequality: 


[Aly =e}? = JA! (y- A~e) + (A! A~e -— ¢)|? 

= |A (y= A7o)|? + 2(y— Ave)’ A(ATA™ — I)e + |AT Ave e|? 
Tye A-e)|” +0+ |A' Ave — c|? 
> |A' Ave sgl? 


I 
= 


bj 


given appropriately-sized y and c. Therefore, for any given matrix A and vector c, 


ly A-=c'|> |c" ATA — eu for all y € R™. 


Using this result, we arrive at the following characterization of pseudoinverse. 


Theorem 5.5 The pseudoinverse of A is a matrix X that minimizes ||AX — I,||r, that is, 


At= argmin ||AX- hlr = argmin ||YA- allr (5.21) 
Y 


X :dXn matrix :dxn matrix 


Proof: Let ej be the j-th standard basic vector in R”. Note that || Bll = X5- | Be; i for any matrix 
B with n columns. If replace x in (5.19) with the j-row of an arbitrary matrix X, i.e., Xe; and set b to 
be ej, we get 

|AXe; — e;| > |AAte; — e;l. 


From this, we can establish the lower bound of || AX — I,,||# as 


| AX — hll? = > |AXe; — ej)? > $ |AAte; — ejl? = || AA+ — Inl? 


j=! i=l 


5.8. Pseudoinverse 113 


Let e; be the i-th standard basic vector in R¢. Note that ||B||?, = ‘ean le? B|? for any matrix B with d 


rows. Similarly, from (5.20), we get 
je; YA —e; | 2 |e; AtA —e; |, 


and subsequently 


d d 
IYA- Hlp =) le YA- erl? > J lef AtA- e; |? = ||AtA — Lull 


j=l i=1 


5.8.1 Generalized Projection and Least Squares 


Let us consider a projection or a least square problem in a more general setup. In this setup, the matrix 
A = |a| -+ |aa] may not be a full-rank matrix. In order to approximately solve Ax = b, we need to 
either project b onto the column space Col (A), or find x that minimizes |Ax — b|. From the inequality 


in (5.19), we know that x = Ab minimizes |Ax — b|. Because A and A* satisfy the Penrose identities, 
(b— AAtb)' AA*b = b! (I — AAt)AAtb = b' (AAt — AAT AA*)b = b! (AAt — AAt)b = 0, 


implying that (b — AAtb) and (AATb) are orthogonal to each other. In summary, A*b is the solution 
to the least squares problem, and at the same time, we can express the orthogonal projection of b onto 


the column space, Col (A) = span{aj,..., aa}, as 
Poo (4) (b) = AA*b, (5.22) 


where AAT is a matrix corresponding to the orthogonal projection onto Col (A). 
If the rank of A was d, the least squares solution is (A'.A)~'A'b, and the orthogonal projection is 
A(A' A)~!A'b. These coincide with the earlier results, as implied by Fact 5.10. 


Let us summarize various orthogonal projections derived for an n x d matrix A = [aı | --- |aa]: 
e Without any particular constraint on A, 
Poo (a)(x) = AATx and At; 
e If the columns of A are linearly independent, i.e. rank A = d, 
Poo (A) (X) = A(A'A)-1A'x and At =(A'A) 1A’; 
e If the columns of A, aj,...,aq, are orthonormal, 
Poo1 (4) (x) = AA'x and At=A'; 

e If there is only one column, v, in A, 

Poot (4) (x) = -wx and At = i, 


Each of these projections is a special case of the one above. 


114 Chapter 5. Singular Value Decomposition (SVD) 


5.9 How to Obtain SVD Meaningfully 


When a matrix is provided us in a form similar to (5.7), we can obtain singular values and vectors by 
Fact 5.2. For instance, if the rank of the matrix is one, we can readily use Fact 5.2 to obtain a singular 
triplet. If this is not the case, it is usual to solve the symmetric eigenvalue problem on a smaller-sized one 
of AA! and A' A. The eigenvectors then serve as right or left singular vectors, and the square root of 
the eigenvalues are singular values. We can compute the remaining singular vectors (either left or right) 


using Lemma 5.2. 


Centering Data 


When we look for the optimal k-dimensional subspace to represent data, we must decide first whether 
we are looking for a k-dimensional subspace or a k-dimensional affine space.” In the latter case, we first 
subtract the mean vector from each data point so that the centroid of the data set is located at the origin 


and then perform SVD. 


Fact 5.11 The k-dimensional affine space that minimizes the sum of squared perpendicular distances to 


the data points must pass through the centroid of the points. 


Proof: Let S = {vo + s CjVj : C1,--.,Ce € R} be an affine space, where v1,..., Vg are orthonormal 
vectors. We use aj,...,a, to refer to the data points. vo is a point in S that is closest to the origin 0 
and is orthogonal to vj. The closest point to a; in S is the projection of a; onto S, and we represent 
it as vo + Da 1G 
is orthogonal to S. That is, (a; — vo — =. 1¢;Vj,Ve) = 0 for all £. By rearranging terms, we get 
c} = (ai — Vo, Ve) = (ai, ve). The sum of squared distances from the data points to S is then 


*v;. The vector that represents the difference between a; and this projected vector 


a 


n n k 
X dist(a;, S)? = 5 ai — Vo — x gyi) 
i=l í=1 j=1 
j F p , 
= S {l vl? —2(a1— wo. Zav) + [oe } 
j=1 J= 
k k 
= 4 ai — Vo = 25` c(a; = Vo, Yy) +5} 
i=j j=l j=l 
k 
= D {le-wl?- 67} 
= = 
a k 
= 5 { a; |” — 2(vo, ai) + lvo]? = Yav} 
i=1 


j=1 


2 An affine space is constructed by translating a linear space off the origin and is often expressed as {vo +£1V1 +: °+TkVk : 
(z1,..- £k) € RF}. Linear transformation is not preserved within an affine space but affine combination is. That is, if vi 


and v2 are included in an affine space, Avı + (1 — A)v2 is also within the same space for an arbitrary real-valued A. 


5.10. Application to Statistics: Principal Components Analysis (PCA) from SVD 115 


n n k 
= nivo? — 2 00a) + D> (lail? - Yai ¥5)”) 
i=1 i=1 j=1 
be Vp 2 2 
= nvo- — Sai -|Z a +> (ail — X (avy) ) 
i=1 i=1 i=1 j=l 
Lig R 2 
Z -|9 a; +5 (la: -X (ai, vj) ) 
i=1 i=1 j=1 


It is thus minimized when vo = 4 577, a;, which is equivalent to saying that it is minimized when the 


affine space S passes through the centroid of the data points. 
E 


5.10 Application to Statistics: Principal Components Analysis 
(PCA) from SVD 


A principal component of a random vector? is a linear combination of random variables which are entries 
of the random vector, where each principal component explains the variability of the random vector. A 
principal component is also a random variable. Therefore, a better principal component explains more 
variability of a random vector. For a given random vector X, a principal component is characterized by 
av = (v1,...,va)' € R2, a weight vector of the linear combination, and then, can be represented as 
v' X= D v;X,;. We adopt the variance of a random variable as the measure of its variability. Then, 
finding the best principal component is equivalent to finding v that maximizes the variance of v' X among 
unit vectors, i.e., |v] = 1. We call this best principal component or the first principal component. 
Assume the mean of X is 0 without loss of generality. Then, the mean of v! X is zero, and its variance 


is 


Var(v' X) = E[(v'X)?] =E[v'X(v'X)"] =E[v' XX'v] = v'E[XX! ]v = v' dv. 


e We can obtain the first principal component by solving 


argmax Var(v' X) = argmaxv! Ev. 
|v|=1 |v|=1 


Since © is not given in practice, but we are only given a set of iid samples, X,,X2,...,Xn, we use 
a sample covariance È estimated from these samples. Let A be the data matrix of which the i-th 
row corresponds to the i-th data point X;. Then, we can write an estimate of the sample covariance 


as® 


3We explain random variables in Appendix D. 

4This can be satisfied easily in practice by subtracting the sample mean from all the data points. 
5Independent and identically distributed. 

6Check yourself how x holds. 


116 


Chapter 5. Singular Value Decomposition (SVD) 


where x; is an observation of X;. Since v' AT Av = |Av|?, we can rewrite the optimization problem 
to find the first principal component into 
argmax v' Sv = argmax | Av]. 
|v[=1 |v|=1 
In other words, the first principal component coincides with the first right singular vector vı of A. 
Furthermore, the variance explained by the first principal component is proportional to the square 


2 
of the first singular value, “4. 
We define the second principal component v, as a principal component that is not correlated’ 
with the first principal component v] X, and that maximizes the variance Var(v' X). In statistical 


terminology, we are solving 


Vv. = argmax Var(v'X). 
|v|=1, Cov(v' X,v] X)=0 


The covariance constraint can be rewritten as Cov(v' X, v] X) = v' Evi = 0. Using the sample 


covariance, we further rewrite it into 
v' Sv, =0 & v AlAv = (Av, Avı) = 0. 


Let us write A using SVD as 


d 
= Saar 
A= > OjUj;V; 


j=1 


with oi > og > ++: > op > 0 Or41 e og. We can express an arbitrary v € R? as 
v= De ajv; with the coefficient vector a = (a1,...,aq)'. Since Av; = gjuj for all j and u,’s 
are orthogonal, (Av, Avı) = Ona ajojUj,o1U1) = ayo?. If (Av, Avı) = 0, we get ay = 0. In 
other words, (v, v1) = 0, that is, the orthogonality between v and vj, is the necessary and sufficient 


condition for uncorrelatedness of v} X and v! X. Mathematically, 
Ty = = 
v ivi,=0 © (v,vi) =0, 
and we can obtain the second principal component by solving 


Vv. = argmax |Av\. 
IvI=1,(v,v1) =0 


2 
That is, the second right singular vector vg is the second principal component, and NER is the 


variance explained by the second principal component. 


We can repeat this procedure for rank A = r times to compute the r principal components. We could 
find the (r + 1)-th principal component, but this component explains nothing since the explained 


variance is zero. 


TCov(v{ X, v] X) = 0. 


5.10. Application to Statistics: Principal Components Analysis (PCA) from SVD 117 


e The total variance explained by k < r principal components is — (oF +-++++ 07), which is the 
maximum variance that can be explained by k right-singular vectors due to the best-fit character 
of SVD. Given a target proportion 0 < a < 1, we choose the smallest k that satisfies 

oit +o? 


2... g2 De fea 2 
Opt OR + OR TOP 


We then call v1,...,v, the principal components that explain 100a% of the total variance. 


These results can be summarized into the following theorem. 


Theorem 5.6 Let Cov(X,X) = © have eigenvectors vi,...,V, with corresponding eigenvalues o? > 03 > 


D> Ge >0. Then: 
(ù) The j-th PC (principal component) is v}X = Vj1Xı +++: +; aXe for j =1,...,7. 
(i) The variance of j-th PC is Var(v] X) = v} Xv; = 03. 
(iii) The covariance between two PCs is uncorrelated, i.e. Cov(v; X, vg X) = v] Xv =0 forj #k. 


It is only natural that this theorem holds by noticing that © = AT A and A = Dci ojujv]. 


118 


Chapter 5. Singular Value Decomposition (SVD) 


Chapter 6 


SVD in Practice 


Singular value decomposition (SVD) is widely used in practice. In this chapter, we consider three practical 
use cases of SVD and its variant. First, SVD is used to compress a large matrix, representing an image, 
into two smaller matrices, without compromising its perceptual quality. This is done by taking only the 
top-K singular triplets, after performing SVD on the full matrix. Second, we show how left singular 
vectors can be used to visualize high-dimensional data, for convenient analysis. As evident from our 
variational formulation of SVD, these left singular vectors are the optimal representations of data in 
terms of reconstruction. We furthermore demonstrate how this approach of using left singular vectors 
for visualization can be extended nonlinearly, with a variational autoencoder. Finally, we show how the 
right singular vectors of financial time-series data automatically capture major underlying factors, by 
analyzing historical yield curves using SVD. These examples are only three out of increasingly many 


applications of SVD in data science and artificial intelligence. 


6.1 Single-Image Compression 


A representative example of using SVD in practice is image compression. An image is often represented as 
a collection pixels on a 2-dimension grid, and each pixel is represented using three numbers corresponding 
to three color channels (r, g, b). An n x d image can thus be represented as a collection of three n x d 
matrices. Let A“), AC), AC) be these three matrices, respectively, and r1, 12,73 be their respective ranks. 


We start by assuming that the column sums of each of these matrices are all 0. 


SVD allows us to represent A as the sum of rank-one matrices: 


G) 


i 


G) 


with positive o} ’°s and vectors u ’°s and vis, If k < min{ri,r2,r3}, rank k approximation of AY) 


Kang and Cho, Linear Algebra for Data Science, 119 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


120 Chapter 6. SVD in Practice 


Figure 6.1: Mona Lisa in 1024 x 687 pixels. 


AM, A a£ (9) wy 


We show in Figure 6.2 five approximated images of the original Mona Lisa in Figure 6.1 with the ranks 
k = 3,8, 18,23 and 34, respectively. Even when we use only 8.3% of the original pixels (see Figure 6.2f), 
it is difficult to discern this approximated (compressed) version from the original. 

When the column means are not zeros, we simply subtract the column mean from each column, 
perform SVD, compute low-rank approximation and add back the column mean to each column. In other 


words, we perform SVD on 
j T 
A) _ 1M; 3 
where 


1 , 
u= SLAN 


This column average is added back to low-rank approximation; 


k 
: ; r ; ET 
AG) A Inul + AD); = Inul + o Dy Dy ; 
J 3 i i i 


i=1 


where low-rank approximation was done on 
E ee T 
AW) — Inu} x AO, = 5 Coi 


Refer to Section 5.9 to see why we need to subtract and add back the column means. 


6.1. Single-Image Compression 121 


(d) Rank 18 (4.4%) (e) Rank 23 (5.6%) (£) Rank 34 (8.3%) 


Figure 6.2: Low-rank approximations of the Mona Lisa (The percentage is the portion of data in use.) 


6.1.1 Singular values reveal the amount of information in low rank approxi- 


mation 


We can quantify how well the original image is represented by the compressed image by 


using the k-largest singular values. We can plot this proportion for each color channel over k, to visually 
inspect and determine the right balance between the compression and fidelity. 

From Figure 6.3, we observe that red and green colors are more easily captured with a fewer singular 
triplets, compared to blue. Even when the ratio of blue was less than 0.8 with k = 34, we were not able 


to visually discern a compressed image from the original image. This may be explained by the fact that 


122 Chapter 6. SVD in Practice 


2 y 
as 
D œ 
3 | 
2 2 
& 
$ © 
Ww o 
oO 
D 
i 
co 
B34 
oO 
a 
y 
i=} 
2 yj 
i=} 


Figure 6.3: Ratio explained by singular triplets 


the proportion of cone cells, which are photoreceptor cells in the retinas, that respond to blue (known to 


be only about 2%) is much lower than those of cone cells responding to red and green. 


6.2. Visualizing High-Dimensional Data via SVD 123 


6.2 Visualizing High-Dimensional Data via SVD 


6.2.1 Left-singular Vectors as the Coordinates of Embedding Vectors in the 
Latent Space 


We are interested in finding a k-dimensional subspace that approximates well n data points in a d- 
dimensional vector space, {a1,...,an} C R4. We can formulate a problem of identifying a k-dimensional 
subspace as a problem of finding k d-dimensional basic vectors that form a basis of the subspace. In 
other words, we can phrase this problem as estimating an unknown d x k matrix V that consists of these 
basic vectors as its columns to maximize X`; |V'a;|?. That is, 
z 2 
yra argmax X` |V] a;| ; (6.1) 
V:dxk ici 
V'V=Ik 


Using the orthonormality of V'V = Ip, we see that 
= (a; = VVTai) | (a; = VV'ai) 
= (a; = al VV!) (a; = VV lai) 


ai — VV a|” 


IVV'a;+a;VV' VV a; 


Because 7j"_, |a;|? is a constant with respect to V, finding V that maximizes ¥;_; |V" a;|? is equivalent 
to finding V that minimizes >", |a; — VV 'a,|?. In other words, the original problem (6.1) can be 


rephrased as 


n 
V* = argmax X` la; — VV ‘a,|’. (6.2) 
i, 

Under this formulation, you can view VV'a; as reconstructing the original vector a; back from the 
k-dimensional compression V! a; by multiplying it with V. Then, this problem can be thought of as min- 
imizing the reconstruction error.'? We often refer to this k-dimensional subspace in which we transformed 

and embedded the data points as a latent space. 
Let A be an n x d data matrix with a,’s as its rows and V be this unknown d x k matrix with v,;’s 
as its columns. Then, the subspace spanned by the column vectors {v,,...,v,}, obtained as a solution 


to (6.1), has the same effect as dimension reduction as the subspace spanned by the singular vectors of A 


that correspond to the & largest singular values, in terms of the squared residual distance, because 
n k 


Y [Va]? = AVI = Y Av}. 


i=1 j=1 


1This is a special case of an autoencoder in machine learning, where VT is an encoder and V is a decoder. That is, this 


corresponds to a linear autoencoder with a tied weight. 
2It is natural to consider VVT as the matrix form of projection onto a subspace spanned by orthonormal vectors, as in 


(4.7). It is however non-trivial to go from projection to reconstruction via VT. 


124 Chapter 6. SVD in Practice 


The column vectors of V however may not be ordered according to the singular values. That is, |Av,;| > 
|Av,;41| may not hold for some i. 

Let b; be the image of a; in the latent space. It is usual for us to call b; the embedding vector 
of a;. The i-th element of the j-th left singular vector of A, (u;);, is a scalar multiple of the j-th 
element of the embedding vector b;, because o;(u;); = (Av;); = aj vj = (V'ai); = (bi);. In other 
words, the j-th left singular vector multiplied by its singular value oj;u,;, which is n-dimensional, is a 
collection of the j-th coordinate values of the n data points in the latent space. That is, the i-th row of 
UX = [ou |o2u2|...| okup] is bj. With an appropriate choice of k, we can gain insights into the data 


point a; by analyzing its embedding vector b;. 


6.2.2 Geometry of MNIST Images According to SVD 


The MNIST Dataset consists of handwritten digits and has been used extrensivly to train and evaluate 
various image processing systems as well as machine learning algorithms. It contains 60,000 training 
images and 10,000 test images of handwritten digits (0-9). See Figure 6.4 for 160 randomly selected 
images from MNIST. 


0000200220002000 
PER a FA Se ET Oe 
Zeta ARLRAAAAALQ2QIiiA“A’WwAZ za 
SS3ES3B BCS 239953 3283333 
GHPYVAADNIWF YAH PAY AY 
SS SOC S SS PST CSS 3S 
6G6b64Q0660bE 6 6CCEEL 
ų 7707747277227727 
¥SBFBESGPFRPF EF 3 
rT TVTFAIGHVIAVLAGTCGAAdT F 
Figure 6.4: MNIST Data Set 


Each handwritten digit in MNIST is represented as a 28 x 28 grayscale image. Each pixel can be one 
of 256 intensity levels (0-255). In Figure 6.5, we demonstrate how the 8-th training image from MNIST, 
that corresponds to a handwritten three, can be plotted in two different way; one as an actual grayscale 
image and the other as a 28 x 28 matrix. 

When analyzing such a dataset, it is often more convenient to analyze it as a collection of vectors 
rather than as images, although the latter tend to be more familiar to us. In the case of MNIST, we can 
reshape the matrix of each handwritten digit into a 784-dimensional vector. Once numerical analysis is 
completed, we can visualize these data points as well as intermediate quantities as images rather than 
vectors. 


One we create a n = 70,000 x d = 784 data matrix A of MNIST by vertically stacking 784-dimensional 


6.2. Visualizing High-Dimensional Data via SVD 125 


(a) An image of 3 (b) 28x28 matrix representation of the left image 
Figure 6.5: An Example of MNIST Image and its Matrix Representation 
vectors, we first compute the mean of the rows (column means) by +17 A, which we visualize in Figure 6.6. 


We can check other properties of this data matrix, such as its rank by using numpy package of Python, 


which results in rank A = 713. 


Figure 6.6: The Mean Image of MNIST 


We now subtract this mean from each row of the data matrix A by 


A— SEI 
after which we perform SVD on A. We first visualize the top-64 right singular vectors, according to their 
associated singular values, in Figure 6.7. Although it is not easy to interpret these right singular vectors 
intuitively, we notice that the spatial frequency increases as the singular values decrease. This is evident 
from the increasingly more frequent flips between black and white contiguous regions. 

These right singular vectors form a basis of a subspace. In Figure 6.8, we plot all rows of the data 
matrix on the subspace spanned by the top-2 leading singular values, by putting each row using the first 
two elements of the corresponding left singular vector. We can visually confirm that digits of a similar 
shape are placed nearby and sometimes even overlap with each other. As could have been guessed from 
the first right singular vector, which depicted a 0-like shape, zeros are clustered in a region with large x. 


Next, we visually inspect the quality of low-rank approximation, while varying the number of the 


126 Chapter 6. SVD in Practice 


o 
E 
E 
E 
aA 
e 
A 


-n 
F 
=. 


: 
Es 
P 
i 


TY 


ee Skah 


Sas aaa 


Figure 6.7: 64 Leading Right Singular Vectors of A 


_ digit class 
e 0 


0.010 


0.005 


0.000 


U[:, 1] 


eoeeeeee#e#eee#¢ 
woOMWINtoauUpWnNPH 


—0.005 


—0.010 


—0.005 0.000 0.005 0.010 0.015 
UL:, 0] 


Figure 6.8: Two Leading Left Singular Vectors of A 


right singular vectors; 2, 22, 42, 62 and 82. For each handwritten digit, we plot its reconstructed versions 


from the corresponding low-dimensional subspaces. 


The first column in Figure 6.9 is the original image, and the subsequent columns correspond to the 
reconstructed images based on small numbers of right singular vectors. Already with 42 right singular 


vectors alone, out of 700 or so, reconstructed images are almost as good as the original ones. We can 


k 2 
quantify the quality of approximation with tats, as in the following figure: 
i=l i 


6.2. Visualizing High-Dimensional Data via SVD 127 


Figure 6.9: More Singular Vectors, More Accurate Approximation 


100 4 
80 7 
60 4 
404 
20 F 


T T T T T T T T T 
ie) 100 200 300 400 500 600 700 800 
Singular Values 


Percentage Explained 


6.2.3 Geometry of MNIST Images in the Latent Space of a Variational Auto- 


Encoder 


Handwritten digits in MNIST still have significant overlaps across digit classes, when we visualize them 
using SVD, as shown in Figure 6.8. This clutters our analysis effort, but is also difficult to overcome due 
to the linearity of SVD. It is thus a common practice to project data nonlinearly onto a lower-dimensional 
subspace for analyzing data more in-depth. A variational autoencoder (VAE) is one representative ex- 
ample of such an approach. A VAE consists of two neural networks, described earlier in Section 3.10.1, 
called the encoder and decoder. The output layer of the encoder has d nodes, and the output from the 
encoder is fed to the decoder as the input. The decoder outputs a 784-dimensional vector, corresponding 
to the dimensionality of the original MNIST image. The VAE is trained to minimize the reconstruction 
error while being regularized to prevent overfitting. The space in which the output from the encoder 
resides is often called a latent space, and the output from the encoder a latent representation. With 


d = 2, we can visualize the MNIST images readily, as in Figure 6.10. 


128 Chapter 6. SVD in Practice 


Figure 6.10: Latent space of VAE 


6.3 Approximation of Financial Time-Series via SVD 


In finance, the yield to maturity of a fixed-interest security (typically, bonds) is an estimate of the rate of 
total return to be earned by its owner who buys it at a market price, holds it to maturity, and receives all 
interest payments and the capital redemption on schedule. The yield curve is a graph which depicts how 
the yields to maturity vary as a function of their years remaining to maturity. The graph’s horizontal 
axis is a time line of months or years remaining to maturity. The vertical axis depicts the annualized 
yield to maturity. As a demonstration, Figure 6.11 shows 416 yield curves observed weekly from 2010 
to 2017. These curves linearly connect yields corresponding to 11 remaining maturities — 3 months, 6 
months, 9 months, one year, one and a half year, two years, two and a half years, 3 years, 5 years, 10 


year, and 20 years. 


Yields 


[0] 50 100 150 200 


Maturity (months) 


Figure 6.11: Yield curves observed weekly from 2010 to 2017. 


To see whether any underlying patterns in the yield curves, we apply SVD to a matrix whose rows 
represent the curves. That is, let us regard the 11 yields observed at each week as a row of a matrix and call 


the matrix A. Note that A is a 416 x 11 matrix. By subtracting the column sum of A, p = A! 1ẹ € R", 


6.3. Approximation of Financial Time-Series via SVD 129 


from each row, we get 
A=A-1gi6p" 


whose columns sums vanish. Denote the SVD of A as 
11 
A= UDV' = XO orugvg 
k=1 


where 01 > 02 > +- > oy, and |v;,| = 1, k = 1,...,11. The i-th row of Â corresponding to the curve at 


i-the week is 
11 
X UikOkVk 
k=1 


where uik is the i-th entry of uz. For A, the first right-singular vector explains 93.74% of the total 


k 2 
variation and the top three right-singular vectors explain 99.9% of the total variation. The ratios dit “5 


$14 
for k =1,...,11 are 


93.74, 99.61, 99.90, 99.94, 99.97, 99.98, 99.98, 99.99, 99.99, 99.99, 100.00, 


respectively. Since the sum of leading three singular values is close to 1, the truncated sum of three 
leading terms approximates the i-th row of A very well. That is, 


11 
X UikOkVk © Ui101V1 + Ui202V2 + UjZ03V3 . (6.3) 
k=1 


Let us look at these right-singular vectors. 


0.4 


0.2 


`“. - $ 
- =- S a e e - -- 


Empirical Factor Curves 
0. 
r 
Scaled Empirical Factor Curves 


-0.4 -0.2 
I 


-0.05 0.00 0.05 0.10 0.15 0.20 
Pa 


0 50 100 150 200 [0] 50 100 150 200 


Maturity (months) Maturity (months) 


(a) Right singular vectors, Vk (b) Scaled right-singular vectors, okVk 


Figure 6.12: Right-singular vectors of leading three singular values 


Three right-singular vectors having the three largest singular values are depicted in the left panel of 
Figure 6.12a. The right panel shows the right-singular vectors multiplied by their singular values. The 
red curve is the first right-singular vector in Figure 6.13. It determines the overall levels of yields and is 
called a level factor in econometrics. The green one affects initial slopes of curves and is called a slope 


factor. The blues one represents the curvature of yield curves near at maturity of 4 years and is called a 


130 Chapter 6. SVD in Practice 


curvature factor. This empirical findings through SVD accompanies an analytical modeling of the three 


factor curves as ‘ X 
= ke 
Ieg l=, sir 


i Aar ’ AT 
whose graphs are depicted in Figure 6.13. 


il 


Factor Curves 


Py Ris eae es 


= . 
o T T T T T 
0 50 100 150 200 


Maturity (months) 


Figure 6.13: Three factor curves of DNS for A = 0.05 


The estimation of the parameter A is crucial in empirical finance. A natural next step would be 
reconstructing the yield curves by these three analytic curves. Denote the yield at t of a bond with its 
maturity date T as y,(7). With well-chosen L, S+, Ci and A, 

—AT 


1— _ —AT 
Y;,(r) = Lex 145, < tC; x ( = os (6.4) 


may approximate y;(7). This idea is called (Dynamic) Nelson-Siegel method in econometrics. If A vary 
through t, the method is called Nelson-Siegel method. If A is fixed for all t, it is called Dynamic Nelson- 
Siegel (DNS) method. If we compare two representations (6.3) and (6.4), we may regard the three right- 
singular vectors in (6.3) as proxies of the three analytic factor functions in (6.4). Then, the leading three 
left-singular vectors correspond to the sequence of (Lz, S+, C+) in (6.4). (u1, u2, us) is depicted in Figure 
6.14 as a sequence of R?-vectors. u1, u2, and uz correspond to red, green, and blue curves, respectively. 
As a finding from the curves, the red curve representing overall level of yields explains that the yields 
from bonds decreases through the period.*? Further empirical findings can be observed in financial view 
point, but we do not dive into. 

Let us compare the original yield curves and the approximated ones in Figure 6.15. 

They seem similar, but the approximated curves in the right panel lose some curved features of original 
yields requiring more right-singular vectors to capture. However, insights from the simple expression 


might be more important for financial decisions than keeping local details of yield curves. 


3Most of yields are explained by 01 v1 whose entries are all positive. The yields are approximated as 
H + Ut1O1V1 + Ut202V2 + Ut303V3 


and usually stay at positive regime even if the coefficient uz1 is negative because of the first mean vector term p. uty has 


nagative values in the second half of the period as we can see in Figure 6.14. 


6.3. Approximation of Financial Time-Series via SVD 131 


0.04 
i 


0.02 
i 


Right Singular Vectors 
0.00 


—0.02 


-0.04 


T T T T T 
0 100 200 300 400 


Issued Week 


Figure 6.14: (uj, u2, u3) plot 


wo 
O 
SJ 5 
o 
o 
3 3J 
š Sos 
o oD 
è] w 
3 E o 
~ 84 é 37 
S & 
< 
N 
o > Ea 
z = fo} 
T T T T T T T T T T 
0 50 100 150 200 0 50 100 150 200 
Maturity (months) Maturity (months) 
(a) Original yields curves (b) Yield curves represented in 3 factors 


Figure 6.15: Illustration of dimension reduction in yield curve representation 


132 


Chapter 6. SVD in Practice 


Chapter 7 


Positive Definite Matrices 


Positive definite matrices are latent everywhere around us. A positive definite matrix called a covariance 
captures the interrelation among scattered observations of multiple random amounts. (Many products 
are designed through) or (Many engineering/data science works are based on ) parameter optimizations 
and the positive definiteness is crucial in efficient and stable solution procedures for such optimizations. 
(Positive definite matrices also have an elegant connection to a convexity through high-dimensional ellip- 
soids, which inspires many geometrical ideas in data science.) The positive definiteness of a square matrix 
is defined by the positive values of its quadratic form for any non-zero vectors. In the view of arithmetic, 
real square matrices are similar to real number. Both have addition and multiplication. A positive real 
number also has positive quadratic forms for any real numbers. A positive definite matrix has a unique 
positive definite square root, as does a positive real number. 
BI} AIPM JAA FAS AH] AAS VY AS AGS ao] ASS} HAI}, Sa] Fo] 
APA AEA ABO] 00] OHI WFR] AA PE OMT S AA YE] 7} ob AA wl fo} yay 
Al Ab St quadratic form °] 44} YFA & AS positive definite matrix Y2 HEA ee] AFA ARS 
qaae AANI AA. JS SH So] AS 7} Ala = 7A) S positive definite matrix = A] za] AA 
Jao] JE ARELA RY AVA. HS oa AFo] Yo] ASS aay AASE] e 
Al & 2 ol] positive definite matrix S Gd} EE cigenvalueS°] S7}s= Sul Be YA" Fz oo] 
1-3-0] 7s str}. Leal Ge] AFA Alo] 1A Z7} AE BS convex shape 9 A A] |Z positive 
definite matrix = 4 2] =] quadratic form = convex function ©] © °]S 23-8 <d+o] 4 ul] 2 A9 ellipsoid 
SAS Daa AAA Ae] s+ Ach. SAANA UES positive definite matrix = covariance 
matrix 7} 2) X covariance matrix 4848 = 42) 51 quadratic form °] multivariate Gaussian density 2] 


A PHS 7S tH. 


Kang and Cho, Linear Algebra for Data Science, 133 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


134 Chapter 7. Positive Definite Matrices 


7.1 Positive (Semi-)Definite Matrices 


Positive (semi-)definiteness is widely used to describe the positiveness of quadratic forms induced by 
special matrices, such as the covariance matrix! of a random vector and the Hessian matrix of a convex? 
multivariate function. Furthermore, there is a beautiful one-to-one correspondence between a positive 
definite matrix and an inner product of a vector space (see Theorem 4.1.) We already introduced pos- 
itive definiteness in Definition 4.2. Similarly, we define a positive semi-definite matrix by relaxing the 


positiveness by a non-negativity. 


The following conditions are equivalent to positive semi-definiteness. 


Fact 7.1 For asymmetric matrix A, the followings are equivalent: 
1. For all x, x' Ax > 0; 
2. All eigenvalues of A are nonnegative; 
3. A= B'B for some matriz B. 


Proof: According to the real spectral theorem, a symmetric matrix A can be expressed as 


A=Ņ Awa] =VAV', 


i=1 
where A = diag(A1,...,An) is a diagonal matrix with eigenvalues of A as its diagonal entries, and 
V = [vi | v2| ++- | Vn] is an orthogonal matrix whose columns consist of orthonormal eigenvectors of A. 


.« (1) > (2): for all j, A; = v] Av; > 0. 


e (2) = (3): Because all diagonal entries of A are greater than or equal to 0, D = diag(V/A1,.--, VAn) 
is well-defined such that A = D? and D' = D. Thus, A = VAV' = VDD' V" =B'B. 


s (> (64x Mx =x! B Bx =|Bx|? > 0. 


We can derive a similar set of equivalences for a positive definite matrix as well. 
Fact 7.2 For a d x d symmetric matrix A, the followings are equivalent: 
1. For all x £0, x' Ax > 0; 


2. All eigenvalues are positive; 


1Refer Appendix D. 
?Refer Appendix A. 


7.1. Positive (Semi-)Definite Matrices 135 


3. A= B'B for some invertible matrix B. 


Proof: As in the proof of Fact 7.1, we have A = X`; Aviv] = VAV! in hand due to the real spectral 


theorem. 
e (1) => (2): for all Ve Aa = vj Av; > 0; 


e (2) = (3): Since diagonal entries of A are positive eigenvalues, D = diag(vÀ1,..., VAn) is well- 
defined such that A = D? and D! = D. Therefore, A= VAV! =VDD'V' = B'B where B! 
consists of linearly independent columns since they are positively scaled columns of the orthogonal 


matrix V; 


© (3) > (1): x! Ax = x" B! Bx =|Bx|? > 0 if x £0 since B is invertible. 


We can further derive the following result from Fact 7.2. 
Fact 7.3 If A is symmetric and positive definite, then A is invertible, and AT! is also positive definite. 


Proof: According to Fact 7.2, a positive definite matrix is a product form of B' B for some invertible 
matrix B. Then, the product itself is also invertible. Furthermore, its inverse Bay" is again a 


form of a positive definite matrix by Fact 7.2. E 


We then can list up conditions that must be satisfied by a positive definite matrix. 


Example 7.1 For a positive (semi-)definite matrix A = (aiz), 


1. all diagonal entries are positive, i.e., aj; > (>)0 for all į since x = e; leads to x! Ax = a;; which has 
to be greater than (or equal to) 0. 
2. leading k x k block (aiz) 1<i<k Of A is also positive (semi-)definite for all k. We can see it by letting 


1<j<k 
t 
X = (xie A 0,0). 


Since x! Ax is a scalar, for any square matrix A including asymmetric one, 


aly TAT 
ot ee a A £ Sx ($(A+AT))x 


holds for any vector x. So, the positive (semi-)definiteness of A is equivalent to the positive (semi- 
)definiteness of the associated symmetric matrix 4(A + A'). For this reason, we study the positive 


definiteness through only symmetric matrices. 


136 Chapter 7. Positive Definite Matrices 


7.2 Cholesky Factorization of a Positive Definite Matrix 


A positive definite matrix A can be expressed as A = B! B with an invertible matrix B, according to 
Fact 7.2. We can further decompose B using QR decomposition as the product of an orthogonal matrix 


Q and an upper triangular matrix R with positive diagonals, i.e., B = QR. Since Q'Q =I, 
A=B'B=(QR)'QR=R'Q'QR=R'R. 
We call this Cholesky decomposition. 


Fact 7.4 (Cholesky decomposition) For a positive definite matrix A, there exists a unique upper 


triangular matrix R with positive diagonal entries such that A = R! R. 


Proof: Since we showed its existence already, we show the uniqueness here. Assume there exist two 
upper-triangular matrices, R = (rij) and S = (s;;), that satisfy A = R'R= S'S. Because S and R are 
both invertible, 

(SHT R' = SR +. 


The inverse of an upper-triangular matrix is upper-triangular, and the product of upper (lower)-triangular 
matrices is an upper(lower)-triangular matrix. Thus, the left-hand side of this equation is a lower- 
triangular matrix, and the right-hand side an upper-triangular matrix. In other words, SRT! is a diagonal 


matrix. For all i, 
Ti Sii 
Eii Tii 
because (S71)! RT = diag(rii/su) and SR7! = diag(si/ru). Since the diagonal elements of both R and 
S are positive, s;; = fri, which results in SR~! = I. Therefore, S = R. 
E 


An Algorithm for Cholesky Decomposition 


Here, we describe how to find R = (r;;) such that A = R! R. ru = Jai, from ay, = r? > 0. We notice 
that the first row of R! R consists of rıırıj s. Combining these two, we get aq; = r11rij = \/a11 71; from 
which we can determine r;;. Once we know r12, we can also determine r22, because 0 < a22 = r? + Thy 
The rest of the second row can be determined from a2; = rj2T1; + r22r2;j, because we know r12, 71; and 
T22 already. We can repeat this procedure to determine all the elements of the rest of the rows, and based 


on the uniqueness of Cholesky decomposition, we know that the resulting matrix is the desired one. 


7.3 The Square Root of a Positive Semi-definite Matrix 


For any positive a, there exists a unique positive real b that satisfies a = b?. An analogy holds for a 


positive definite matrix. For any positive definite matrix A, there exists a positive definite matrix B that 


7.4. Variational Characterization of Symmetric Eigenvalues 137 


satisfies A = B?. Without the symmetry of B, this property is different from Fact 7.2, although two 
statements are quite similar. 

Let A be a symmetric positive (semi-)definite matrix. We can use real spectral decomposition to 
express it as 


A=VAV', 


where V is an orthogonal matrix and A is a non-negative diagonal matrix. If we let D = diag(/\;), D 


is also positive-definite or positive semi-definite, identically to A, and A = D?. If we let B= VDV", 
A=VD?V' =VDV'VDV' = BB = PB?, 
because V'V = I. A and B share the same positive definiteness. 


Fact 7.5 For a positive (semi-)definite matrix A, there exists a unique positive (semi-) definite matrix B 


such that A= B? = BTB. We denote this B as AÈ. 


Proof: Since we already showed the existence above, we show the uniqueness here. Using the projection 
form of real spectral decomposition (5.14), we write A as A = Te MPa, and a symmetric B as 
BS jai jP B,- Because it is orthogonal projection, Pa = Pz,,,, and if Pga = Pg „,, and 


Pop Pau, =0 for j #k. Thus, we get B? = (Dia PEN = Xi HP ou For A= B? to hold, 


r = s must hold, and for some i, uj = vA; and Pg, =P 4,,,. Therefore, B must be expressed as 


B= >. VP Ad, 
i=l 


which is unique. 


7.4 Variational Characterization of Symmetric Eigenvalues 


An n x n symmetric matrix A has n real eigenvalues. We sort these eigenvalues as follows: 
Amax(A) = Ai (A) = A2(A) > ahs > An (A) = Amin(A) : 


We can relate these eigenvalues to the maximum and minimum values of the Rayleigh quotient defined as 


7 
x Ax 
—— 0. 
xx? > 7 
For any x 4 0, we get 
z 
x Ax 
+ =y Ay 
x! x 


by setting y = EP meaning that we may consider a quadratic form over unit vectors instead to investigate 
the Rayleigh quotient. 

The Rayleigh quotients over a subspace spanned by a subset of eigenvectors are bounded by the 
minimum and maximum corresponding eigenvalues. This result will be used occasionally throughout the 


book. 


138 Chapter 7. Positive Definite Matrices 


Lemma 7.1 Consider an nxn symmetric matriz A and its real spectral decomposition A = )~'_, i(A)viv; . 
Letl<p<q<n. Then, for any non-zero vector x € span{V,,Vp+41,---,Vq}; 
T 
x Ax 
r(A) < AE < a,(A). 


The upper and lower bounds are achieved by vp and vq, respectively. 


Proof: With x = ae TiVi, 
q 
x Ae (> nv] ) ò? Àj (A)v;vy ) es zave) = 5 Ai(A)z? 
i=p j=l k=p i=p 
Because Ap(A) > A;(A) > Aq(A) and x'x = J ip 2?, 
q 
dg(A)x"x < XA (A)? < Ap(A)x"x 
i=p 
In addition, A,(A) = vp Avp and \,(A) = v4 Avg hold. a 


As a special case of this lemma, we get the following result. 


Theorem 7.1 (Rayleigh quotients) For any n x n symmetric matrix A, 


Amax(A) for allx £0. 


Moreover, 


and the maximum and minimum are attained for x = vı and for x = Vn, respectively, where vı 
(resp. Vy) is the unit-norm eigenvector of A associated with its largest (resp. smallest) eigenvalue 


of A. 


Proof: By real spectral decomposition, we can express a symmetric matrix A using orthonormal vectors, 


Vi,+-+, V, as 
i=1 
From Lemma 7.1, we get 
A (A)x!x > x! Ax > An (A)x!x 


when p = 1 and q = n. From this, we see that A1(A) and \,,(A) are the maximum and minimum of the 
Rayleigh quotients, respectively. 
i 


We can generalize the results in Lemma 7.1 and Theorem 7.1 to derive the bounds for the maximum/ 


minimum values of the Rayleigh quotient over a k-dimensional subspace. 


7.4. Variational Characterization of Symmetric Eigenvalues 139 


Lemma 7.2 (Poincare inequality) Consider an n x n symmetric matrix A and a k-dimensional sub- 


space W of R”. Then, 


T T 

x Ax A 

min < (A), max 2 T 2 > An-k+41(4). 
xew x'x yew y y 

x40 y40 


Proof: Let the real spectral decomposition of A be A = X; 4 \i(A)viv). For W’ = span{vz,...,Vn} 
satisfying WOW’ = {0}, dimspan(WUW’) = n+1, because dim(W’) = n—&+1, which is contradictory. 
Thus, there must be x £0 in WNW’. According to Lemma 7.1, xTAK < Ap (A), because x € W’. Since 


t'a 


x is also in W, we get the first inequality. We can prove the second inequality following the same steps 
starting from W’ = span{v1, V2, ...,Vn-k+1}- 

E 

Combining the results above, we derive the following useful result showing that every eigenvalue of a 


symmetric matrix can be expressed as minimax or maxmin of the Rayleigh quotient. 


Theorem 7.2 (Minimax Principle) Consider an n x n symmetric matrix A. Then, for 1 < 


k<n, 


max min 
W:dim W=k xeW 
x40 


min 
W:dim W=n—k+1 


Proof: Let the real spectral decomposition of A be A = X; \;(A)viv;. By the first inequality in 


Lemma 7.2, 


for any subspace W with dimW = k. In order to prove (7.1), we then need to find the k-dimensional 


subspace where the minimum Rayleigh quotient is \;,(A). When W* = span{v1,..., vz}, a > Az (A) 
for 0 A x € W*, according to Lemma 7.1. This implies minyew* a = (A), as vi Avg = Ax (A). 
In other words, because the minimum Rayleigh quotient in eye Monnens subspace W* is \;,(A), it 
holds that 


+ 
2, X ÁX 
Ax(A) = — max min ——. 
W:dim W=k xcW x! x 

x0 


We can prove the second equality using the remaining inequalities from Lemma 7.2. 


From Theorem 7.2, we now know that 


Xmax(A) = A1(A) = maxx! Ax, Amin(A) = An(A) = min x! Ax. (7.3) 


140 Chapter 7. Positive Definite Matrices 


By replacing A with A' A, we find the following relationship against the spectral norm of A: 


M(ATA) = max |Ax| = ||All2; yAn(ATA) = min [Ax 


7.4.1 Eigenvalues and Singular Values of a Matrix Sum 
We can derive the following result on the eigenvalues of a matrix sum, called the Weyl’s inequality, from 


the minimax principle above. 


Theorem 7.3 (Weyl’s inequality of eigenvalues) Let A and B be n x n symmetric matrices 


with eigenvalues A;(A) and ;(B), respectively. Then 


Akti (A + B) < Akt (A) + Av41(B) 


Tonk CSO 


Proof: Let us bound àķ+e+1(A + B) in terms of the eigenvalues of A and B. By the minimax principle 


(Theorem 7.2), we have 


7 
x (A+ B)x 
a A+B)= i 
Anseri(A + B) Widim Won—k—0 xe x! x 
X 


Again by the minimax principle, we can find subspaces W4 and Wz of R” of dimensions n — k and n — £, 


respectively, such that 


TA TB 
Nei (A) = max — and Aqı(B)= l 


max —— 
xceWa X X xcWg X X 
x40 x40 


If we let W1 = W4NWs be their intersection, then it is clear that 


T T 
x' Ax x Bx 

a À A d a AÀ B 
sew xx ~ s(4) an sew xx ~ e+1(B) 
x= x 


This intersection W , has dimension at least n — k — €. Let W2 be any (n — k — €)-dimensional subspace 


of W,. We then have 


Ak+e+1(4 +B) < max x!(A+B)x 
xEW2,|x|=1 
< max x!'(A+B)x (since W2 C W1) 
xceW,|x|=1 
< max x' Ax + max x'Bx 
xE€W1,|x|=1 x€W),|x|=1 
< Anyi (A) + Av41(B). 


E 
By symmetrization (5.15), we can turn the result on eigenvalues of a symmetric matrix into that on 


singular values of an arbitrary matrix. Along this line, we can rewrite Theorem 7.3 to be about singular 


7.4. Variational Characterization of Symmetric Eigenvalues 141 


values of an arbitrary matrix. For notational convenience, o,(A) denotes the k-th largest singular value 
of A. 


Theorem 7.4 (Weyl’s inequality of singular Values) Let A and B be matrices of same size 


with singular values 0;(A) and o;(B), respectively. Then, for k,é=0,1,2,..., 


ak+e+1(A + B) < on41(A) + oe41(B). 


Proof: According to Lemma 5.4, 0;(A) = ;(s(A)), o;(B) = 4(s(B)) and o;(A + B) = A;(s(A+ B)). 


Because symmetrization is symmetric, we get 
Or+e+1(A+B) = Angeri(s(At+B)) = Ax+041(s(A)+s(B)) < Aggi(s(A))+A041(8(B)) = on41(A)+o041(B), 


according to Theorem 7.3. 
E 
If the (k+ 1)-th and (€+ 1)-th singular values of A and B, respectively, are very small, we may regard 
their ranks to be k and £, respectively, as well. From the above result, we know that the (k + + 1)-th 
singular value of their sum A + B must also be very small, and consequently, the rank of A + B might 
be regarded as at most k + £. 
When £ = 1 in Theorem 7.4, the inequality reduces to o441(A + B) < o¢41(A) + 01(B). From this, 


we get the following observation. 


Fact 7.6 (Bound of additive perturbation) Let A and B be m x n matrices. Then, for 1 < k < 
min{m, n}, 


|ox(A) — on%(B)| < o1(A— B) = ||A— Blle. 
Proof: When £= 1 in Theorem 7.4, 
oK(A) < ok(B)+0o:(A-— B), o%(B) < on(A) + 01(B — A), 


with appropriate choices of A and B. Combining with oj(-) = || - |]2, we prove the statement. i 
The following inequality involving an eigenvalue of the sum of two diagonal matrices comes handy 


later. 
Fact 7.7 Let A and B ben x n symmetric matrices. Then, forl1<k<n, 
An (A) + Amin(B) < Ap(A + B) í Ak(A) + Amax(B) ® 


Proof: By (7.3), it holds that A,(B) < x! Bx < àı(B) for all x Æ 0. Therefore, for all x Æ 0, the 


x! x 


following also holds: 


x! Ax x! Ax F x! Bx x (A+ B)x z x! Ax 
x'x x'x x'x ~ xlx 


x! x 


142 Chapter 7. Positive Definite Matrices 


from which we get 


T T T 
. X Ax _. x'(A+B)x _ x Ax 
max min —~— + An(B) < max min oo < mex tin F à (B). 
dim W=k xeW X'X dim W=k xEeW X X dim W=k xeW X'X 
x40 x40 x40 


We derive the desired inequalities using Theorem 7.2. 
E 
Note that we proved both inequalities at the same time, rather than using Theorem 7.3 to prove the 
second inequality, which would have required a separate proof for the first inequality. 
From this result, we observe that the eigenvalue increases when we add a positive (semi-)definite 


matrix to a symmetric matrix. 

Fact 7.8 Let A and B be n x n symmetric matrices. Let 1 < k< n. If B is positive semi-definite, then 
Akl A) < àk(A +B). 

If B is positive definite, then 
Ap (A) < Ax (A+ B). 


If we add a symmetric positive semi-definite matrix of rank-one to a symmetric matrix, we can further 


obtain an upper bound on the eigenvalues as follows, 
Ak(A) < Ak(A +a") < Ax (A) + Ia? 
foralll<k<n. 


Proof: Since B is positive semi-definite, Amin(B) > 0. Hence, Fact 7.7 implies the first result. If B 
is positive definite, Amin(B) > 0 implies the second result. A symmetric positive semi-definite matrix of 
rank-one is always in a form of qq' for a non-zero vector q. It is easy to see that Amin(qq" ) = 0 and 
Amax(aq' ) = |q|?. Fact 7.7 then implies the third result. a 

We arrive at the following result on eigenvalue interlacing by systematically analyzing how eigenvalues 


change when a rank-one positive semi-definite matrix is added to a symmetric matrix. 


Theorem 7.5 (Eigenvalue Interlacing) Let A be ann x n symmetric matrix and B ann xn 


symmetric positive semi-definite matriz of rank-one. Then, 


Apsi(A + B) < Ap (A) <A,(A+ B), for allk=1,...,n-1 


Neti (A) < AR(A-— B) < (A), for allk =1,...,n—1. 


Proof: Given appropriate orthonormal vector sets, {v1,..., Vn} and {u1,..., Un}, we can write A+ B 
and Aas z n 
A+B= D (A+ B)viv; and A= 2. ri (Aaja; 


i? 
i=l i=1 


7.5. Ellipsoidal Geometry of Positive Definite Matrices 143 


using real spectral decomposition. Consider the following three subspaces of R”: U = span{ug,...,Un}, 
V = span{vj,...,Vx41}, and W = null B. Because dim U+dim Y = n+2, dim(UNV) > 2. Furthermore, 
because dim W = n — rank(B) = n — 1, dim(UNVNW) > 1. Let x be a unit vector in UN YAW. Since 
B$ = 0, £ € V, and £ € U, 


Akyı(A4+B) < *&'(A+B)% by Lemma 7.1 and f €V 


= x! Ag by Bx =0 
< A,(A) by Lemma 7.1 and x € U. 


Combining this with Fact 7.8, we obtain the two inequalities in the first line. We can prove the inequalities 
in the second line by replacing A with A — B in the proof here. 
| 


7.5 Ellipsoidal Geometry of Positive Definite Matrices 


A geometric interpretation of an n x n symmetric positive definite matrix A can be related to an ellipsoid 


in R”. Consider the following ellipsoidal set based on a quadratic inequality: 
E= {x ¢ Rx xl}. 


Since we can easily translate this ellipsoid to be centered on a by {x € R” : (x—a)' A7!(x —a) < 1}, we 
will only consider €. According to Fact 7.3, AT! is also positive definite, and we can write it using a set 


of orthonormal vectors, v1,...,Vn, as 


We can write an arbitrary vector x € R” as x = }`;—; yvi with y; = (x, vi). From this we get 


n 2 
he e Yi 
x A ap See, 


We can then represent an ellipsoid in terms of {v1,..., Vn} as 


Let us convert this expression into a form that is more familiar to us where each axis of the ellipsoid 
coincides with each standard basic vector. We start from €’ = {y € R” : Soi, y?/Ai(A) < 1} and 
linearly transform e; to v;, in order to obtain €. The i-th longest axis of this ellipsoid has the length 
of 2./ri(A) and the direction of v;. Since both bases are orthonormal, it holds intuitively that two 
n-dimensional ellipsoids are of equal volumes, that is, vol(€) = vol(€’). Let Bn = {x € R” :x'x < 1} 


be an n-dimensional unit sphere. We can transform this unit sphere B, into the ellipsoid E’ by linearly 


144 Chapter 7. Positive Definite Matrices 


transforming e; to \/A;(A)e;. Since the length of each orthogonal direction e; grows from 1 to ./A;(A), 
the volume of E’ is [J;"_, \/A;(A) times that of the unit sphere. That is, 


vol(€) = vol(£’) (7.5) 


Let us consider a related concept called Mahalanobis distance. Given a positive definite matrix A, 
both x! Ay and x! A7'y are inner products, according to Theorem 4.1. The inner product induces a 


norm f(x) = Vx! A-!x, and further we can view f(x—y) as a distance between x and y, as in 


d(x,y) = f(x- y) = y (& - y)TA-1 (x — y). 


We call such a distance the Mahalnobis distance with respect to the positive definite matrix A. From 
the perspective of data science, this is a distance between two vectors taking into account the covariance 
of a data distribution. An ellipsoid from above can be thought of as a set of all vectors whose Mahalanobis 


distances to a center vector are less than or equal to 1. 


7.6 Application to Data Science: a Kernel Trick in Machine 


Learning 


We base our discussion here on [3]. 
Consider a particular instance of classification, where our goal is to find a linear function? that classifies 


data points into two groups. There are n data points: 
{ (zi, 4i) E€ RY x {+1,-1} = 1, rare n}. 


We use Z4} = {i : 4 = +1} and Z_{i : 4 = —1} as index sets of the positive and negative examples, 


respectively. For each group, we use ny = |Z | (n- = |Z. |) as its size and wy = a er, zi (u_ = 


+ Pez. Zi) as its centroid. 

A new observation z is classified based on whether the new vector is pointing toward the same direction 
from the mean of the group means, u = (po 4+ + p_), as the direction from the negative group mean to 
the positive group mean. That is, 


(=sign(z— H, H4 — H). 


If we expand this classification rule further, we see that this is expressed as the sum of inner products 


against the data points, as in 


(z-pu,u} -= H) = (2,4 -H_)- Li + pf, My H) 
= (zp) — (08) F(U) — (ste) 


3In practice, we use an affine function by adding a constant term to a linear function. 


7.6. Application to Data Science: a Kernel Trick in Machine Learning 145 


= H S 21) — = So (2,21) — b, 


iET4} ~ 4€T_ 
where b = 4 ( (u3, p4) — (u, u—}) = $(|m4|? — |_|?) is a constant. 

We use a kernel trick to handle data points that are not linearly separable. This is especially useful 
when the data points in the original data {(x;,é;) € R? x {+1,-1} : i = 1,...,n} are not linearly 
separated into positive and negative classes, but they are linearly separable after they are embedded into 
a higher-dimensional space, called a feature space, using a nonlinear transformation 7): R? + RN. N 
is often greater than d and can even be infinitely large. Then, the classification rule in this feature space 


becomes 


and b is also similarly defined using ((x;),~(x;)). We call the inner product between the embedded 


vectors a kernel and use K(-,-) to refer to it: 


K (x,y) = (W(x), by). (7.6) 


This helps us simplify the classification rule into 


This kernel-based perspective is useful when it is computationally more favourable to compute K(-,-) 
directly than to compute the inner product of two vectors after embedding them using y(x). We consider 
two examples below. 

A kernel in (7.6) is symmetric, i.e., K(x,y) = K(y,x), because it is an inner product, and a matrix 
constructed from K (x;,x;) is positive definite, according to Theorem 4.1. We thus refer to such a kernel 
as a symmetric positive definite kernel. 

Let us introduce a few notations to factiliate analyzing kernels. Similar to the standard inner product 
in a finite-dimensional Euclidean space, we can define an inner product between two vectors in an infinite- 


dimensional space as 
co 
((ax); (be)) = So andr, 
k=1 
where (ap) = (a1,a2,...) E R® and (bk) = (b1, b2,...) E R®. We use ; to indicate that we are vertically 
u 
stacking two vectors. For instance, (u; v) = . We also define j = (j1, jo,.-.,5z)' € {1,...,n}* asa 
v 
vector of natural numbers. 


A Polynomial Kernel 


For two vectors x,y € R” and a positive integer k, we can expand the k-th power of an inner product as 


(xTy)" = (x ejg) Os ziy) 


146 Chapter 7. Positive Definite Matrices 


II 


k 
5 Il UiY Ii 


je{1- sn} i= 1 


5 AIDS 


je{l,---n}*® i=1 i=1 


If we denote all n* vectors in {1,...,n}* by jf,...,j@, G®,..., jO} = {1,...,n}®. We use (jO); 


or je ) to refer to the i-th entry of jP). Using these notations and re-arranging terms, we get 


k 
ky =S T] ego, Hv jæ, = 3 The i), Tu. 


p=1 uml pal i=1 


We introduce the following nonlinear embedding transformation Yp : R” —> R?” to interpret the last 


summation with n* summands as a standard inner product: 


k k k T 
k 
e) n 
= (Izo. izo T]2j0%,) ER” . (7.7) 
i=l =r t=1 


This embedding results in a simplified representation: 


(x'y) mE (we (x), Ykly))- 


Because y depends on the specification of each j which is not uniquely determined, y is not uniquely 
determined either. 


In other words, the k-th order polynomial kernel in R” 
K(x,y) = (x"y)" 
can be represented by a standard inner product in R” 
(k(x), vk(y)) - 


Let us augment x as x’ = (1;x) to handle constant terms of polynomials. By setting x’ = (1;x) and 


= (1;y), we can obtain 


(1+x'y)" = (x Ty)" = (Pry (x), Yk (y) 


just as same as the above case with embedding Yk+1.- 


Gaussian Kernel 


One of the most popularly used kernels is the Gaussian kernel. It is defined as K(x, y) = e732 (2-9) in 


R1. If we expand this kernel, we get 


a gene goat gual ety 


7.6. Application to Data Science: a Kernel Trick in Machine Learning 147 


Di2 _1,2y 1 
= chet Fey! 


k=0 
oe k k 
= ye eo e A 
Vk! k! 


=0 
We interpret the last infinite sum as an inner product in R® by introducing the following embedding 


Yai : R> R”: 
2 k 


1 x £ 
T EEA ZES 
V2! Vk! 


This leads to the following simplified representation: 


2 


paile) =e 2” ( 


JER”. (7.8) 


K(a,y) = (Yai (z), Var(y)) - 


Let us generalize the inner product representation to the Gaussian kernel in R”. For two vectors x,y € R”, 


we derive this kernel as K(x, y) = e~2/*-yl” using the embedding function (7.7), following 


eagle ee eill? p=alyl ex" Y 


1 
= gest sil? y gey) 


k=0 
7 on dle? e- ly? D = (ela) Ur(y)) 
k=0 


2 1 


We rewrite the sum of the inner products as an inner product in R° by defining the generalized embedding 
Wok : R” > R” as 

1 
v2! 


With this embedding* we see that we can write the Gaussian kernel as an inner product in the 


bar) = eT? (1; a(x); oval): sails eRe 


embedding space: 


K(x, y) = (War(x), Vary) - 


Kernel tricks have been extensively studied both theoretically and practically and are widely used in 


practice. We suggest you refer to other materials for further discussion. 


4Already with only the leading four terms of wor, we notice the explosion of the embedding dimension: 


(1; pi (x); Av (x); A p3 (x)) Ritn’ +n? +n? f 


148 


Chapter 7. Positive Definite Matrices 


Chapter 8 


Determinants 


Given n vectors in R”, a volume of parallelopiped with the n vectors as its edges is an important 
quantity for many scientific works. Even though a volume seems intrinsic to the Euclidean space, it 
needs an agreement on what a volume in R” for n > 4 means. A starting point is a volume of the unit 
cube {x € R” : 0 < a; < 1 for alli = 1,...,n} in R”. We set the volume of the unit cube in R” as 1 
regardless of its dimension. It would also make sense that lengthening an edge of a parallelopiped twice 
inflates the volume twice as well as that adding volumes of two parallelopipeds sharing n — 1 edges equals 
the volume of a single parallelopiped built by adding two unmatched edge vectors and keeping the other 
n—1 edges. In addition, swapping of the coordinates doesn’t change the volumes of objects in R”. These 
properties encapsulate the volume in high-dimensional Euclidean spaces. 

Mathematicians invented a function called determinant, with its symbol det, that obeys these rules 
on square matrices. If we build an n x n matrix A by stacking n vectors in R” row-wise, det(A) is a 
signed volume of the parallelopiped with the n rows of A as its edges. det(I„) = 1 since the identity 
matrix corresponds to the unit cube. det(A) is linear in each row of A, in other words, multilinear in the 
rows of A. In addition, det(A) changes its sign, not its absolute value if we swap two rows. In fact, there 
is only one function satisfying the above three properties, which we will show soon. Therefore, we may 
set | det(A)| as a volume of parallelopiped built by n rows as its edges. 


The det function further satisfies the following properties: 
e det(A) = 0 if and only if A is singular; 
e det(A!) = det(A); 
e det(A~') = det(A)~! for invertible A. 
It can be written down a function of all entries of the matrix, which is referred to as Laplace expansion 


just like det ( ) = ad — bc for 2 x 2 matrices. 


C 


Kang and Cho, Linear Algebra for Data Science, 149 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


150 Chapter 8. Determinants 


Finally, it is useful to relate determinants of various matrices in practice. For instance, it is useful 
in optimization to know the relationship between the determinants of an inverse matrix before and after 
adding a rank-one matrix to the original matrix, which is given by the Sherman-Morrison formula. We can 
also use the Cramer’s rule to express a solution to a linear system using the determinant of a coefficient 


matrix, although it is not computationally efficient in practice. 


8.1 Definition and Properties 


We start by defining the determinant more formally. 


Definition 8.1 We call a function det on n x n square matrices a determinant when it satisfies 


the following properties: 


1. det(In) = 1; 


2. The sign of det flips if a pair of rows in a matriz are swapped. That is, det(A) = — det (Â) 


where A results from swapping two rows of A. 


. It is linear with respect to each rows of A. When a row of A can be expressed as b + ac, let 
us construct two matrices, B and C, from A by replacing this particular row with b and c, 


respectively. Then, det(A) = det(B) + adet(C).% 


“We can get this result by combining the second property and that the determinant is linear with the first row 
of the matrix. If we want a minimal definition, we can thus change this third property to be about the first row of 


a matrix. 


Starting from these three properties, we can derive many other properties of the determinant. 
Fact 8.1 If two rows of a square matrix A are equal, then det(A) = 0. 


Proof: Because swapping these two rows does not change the matrix itself, det(A) = — det(A), which 


implies that det(A) = 0. E 
Fact 8.2 Adding a multiple of one row to another row leaves the same determinant. 


Proof: Let b and c be the i-th and j-th rows of a square matrix B, respectively, with i 4 j. We 
construct two matrices from B, C by replacing the i-th row of B with c, and A by replacing the i-th row 
of B with b + ac. Because the i-th and j-th rows of C are same, det(C) = 0, according to Fact 8.1, and 
according to the third property in Definition 8.1, det(A) = det(B) + adet(C) = det (B). 

E 


Fact 8.3 If A has a row of zeros, then det(A) = 0. 


8.1. Definition and Properties 151 


Proof: We set a = —1 and b = c in the third property from Defintion 8.1. Then, because B = C, the 
determinant of A is 0. 
E 


Fact 8.4 If A is triangular, then det(A) is the product of the diagonal entries. 


Proof: Assume A is an upper-triangular matrix. If ann = 0, the n-th row of A is zero, and thus 
det(A) = 0, proving the statement. On the other hand, consider the case where ann # 0. According to 
Fact 8.2, the determinant does not change even if we add a scalar multiple of the last row to another row 
in the matrix. Because only ann is non-zero in the last row of an upper-triangular matrix, we can replace 
all the other elements in the last column of the matrix with 0, without altering its determinant. Let such 
a matrix be A,_1. If we repeat this procedure with a(n—1)(n—1), we end up with a matrix whose last two 
rows and columns constitute a diagonal matrix, again while keeping the determinant the same. We can 
repeat this procedure until we end up with a diagonal matrix D such that det(A) = det(D). Together 
with the first and third properties from Definition 8.1, det(D) is the product of all diagonal entries, which 
proves this fact. 

| 


If a diagonal matrix has zero on its diagonal, its determinant is zero, according to Fact 8.4. More 


generally, the determinant of a non-invertible matrix is zero. 
Fact 8.5 A is invertible if and only if det(A) 4 0. 


Proof: Consider LU-decomposition of a matrix A. For an appropriate choice of a permutation matrix 
P, there exist a lower-triangular matrix L and an upper-triangular matrix U that allow us to obtain 
a row-echelon form of A, and this can be expressed as LPA = U. Because L represents adding scalar 


multiples of rows to other rows, det(ZPA) = det(PA) according to Fact 8.2. Similarly, because P 


represents swapping rows of A, det(PA) = +det A according to the second property from Definition 8.1. 
Combining these two together, we get det A = + det U, as det(LP.A) = +det A = det U. A necessary and 


sufficient condition for an invertible square matrix A from the previous chapter is that all diagonal entries 
of the row echelon form U must be non-zero pivot elements. This condition is equivalent to det(U) Æ 0, 
according to Fact 8.4. 

E 


We now consider whether a function satisfying three conditions in Definition 8.1 exists and if so 
whether it is unique. Let f be a function that satisfies the second and third conditions in Definition 
8.1. Since none of Fact 8.1, 8.2, and 8.3 used the first property in their derivations, we can assume that 
they are applicable to f. Let use construct an (n — 1) x n matrix A’ by taking all the rows except for 
the first row from A. We also construct the following n x n matrix A; by keeping only the j-th entry 
aije] 


A’ 


of the first row while setting all the other elements to 0, that is, A; = . Since the first row of 


152 Chapter 8. Determinants 


A is aye] + aie] +---+aine), f(A) = f(A1) +--+ f(An) according to the third condition in the 
definition of the determinant. 


We can repeat the same procedure to the second row by creating an (n — 2) x n matrix A” by 


a1je; 
removing the first two rows from A. Then, we can construct Ajk = |ag,e, | , and see that f(A;) = 
A” 
f(Aji) +--+ + f(Ajn). When the sub-indices coincide, ie. Ajj, we notice that f(A;;) = 0, because 
ej 
FA) = agaz; f | |ej 
A" 


After repeating this procedure to all n rows, Aj, j,...;, is a matrix with its k-th row being kj, En’ and 
as we have seen above, if jy = je for two rows k and £, f(Aj,j....;,.) = 0. After removing these obviously 
zero entries, we need to consider Aj,;,.__;,,’8 only for which jẹ’s are all different. In those cases, we get 
ji, j2,- - - , Jn by re-ordering 1,2,...,n, that is, as a permutation of 1,2,...,n. If we use a(i) to denote the 
positive integer corresponding to the i-th position in a permutation, Aj, j,...j,, = Ao(1)o(2)...0(n) Then, 


f(A) = 5 f(Ao(ajo(2)...0(n))- 


o: permutation of {1,...,n} 


Let P, = : be the permutation matrix corresponding to the permutation ø. Although we do 


es (n) 


not prove it in this book, the number of row swaps needed to turn P, into an identity matrix I is known 
to be either even or odd regardless of how these swaps are performed, as the order of as well as the 
number of row swaps are not unique. With this fact, we define a sign function on permutation to return 


1 for even permutation and —1 for odd permutation. Then, 


G10(1)€5(1) es(1) 


420 (2)€5(2) ©5(2) 


F(Aeiye2, lo) Ey f = Qio(1)420(2) `t ano(n)f 


Ano(n)&g(n) e(n) 


=  @1o(1)420(2) "°° Qno(n) f (Po) = sign(c) @10(1)420(2) °° ‘ano(n) f (I). 


This allows us to write f(A) as 
f(A) = DD sign(c) @10(1)420(2) °°" dag (ny f (I). (8.1) 
o: permutation of {1,...,n} 
We arrived at this expression by using the second and third properties from Definition 8.1 only, and the 
determinant satisfies this equation as well. Conversely, any f defined as in (8.1) satisfies the second and 


third properties of Definition 8.1. 


Theorem 8.1 The function satisfying Definition 8.1 is unique. 


8.1. Definition and Properties 153 


Proof: Because f in (8.1) already satisfies the second and third conditions in Definition 8.1, f is a 
determinant as long as f(J) = 1. That is, by defining f(Z) = 1, there exists at least one function that 
satisfies all three conditions in Definition 8.1, to which we refer by det. 
Let g be a function that satisfies both the second and third conditions in Definition 8.1. We define h 
for any n x n matrix A by 
h(A) = g(A) — det(A)g(J). 


Then, A(I) = g(I) — det(I)g(Z) = 0. Because both g and det satisfy the second and third conditions, so 
does h. We can thus expand h in the form of (8.1) by 


h(A) = >; sign(7) @1¢(1)420(2) °** @no(nyh(1). 
c: permutation of {1,...,n} 
In this form, h(A) = 0 for any A, because h(I) = 0. That is, g(A) = det(A)g(JZ) hold for all A. If 
g(Z) = 1, g = det, implying that there is only one function that satisfies all three properties in Definition 
8i 
ei 
In addition to the existence and uniqueness of the determinant, we also obtained the following ex- 


panded form of (8.1) in this proof: 


det A = 5 sign(c) @10(1)420(2) as *Ano(n)- (8.2) 


o: permutation of {1,...,n} 


As described in Appendix B, o~! 


is also a permutation if ø is a permutation. Furthermore, sign(o) = 
sign(o—') since {i : o(i) Æ i} equals {j : o~'(7) Æ j} in cardinality. Applying these to the summand in 


(8.2), we get 
sign(c) II Qio(i) = sign(o*) I] Agami (gyz: 
i=1 j=1 


Combining this with a fact {ø : ø is a permutation} = {o~! : ø is a permutation}, we get the following 
g 


alternative expression: 


det A = 5 sign (o) ao(1)100(2)2 `° ' Qo(n)n - (8.3) 

o: permutation of {1,...,n} 
These expansions are useful later when for instance we compute the determinant of a small matrix or 
derive some theoretical results on matrices. Theorem 8.1 allows us to observe an important rule to 


compute the determinant of matrix products. 
Fact 8.6 For two square matrix A and B of same size, det( AB) = det(A) det (B). 


Proof: Assume a fixed size for both A and B. If B is singular, the equality holds as both sides are 
trivially 0. Otherwise, det B Æ 0. Consider a real-valued function p for any n x n square matrix C, 


defined as follows: 
_ det(CB) 


Pn) = det(B) ` 


154 Chapter 8. Determinants 


1. fC =I, p(t) = SP = 1 


2. Since each row of CB is the product of the row of C and the matrix B, by swapping the i-th and 
j-th rows of C, the corresponding rows in CB are also swapped. That is, the sign of det(C B) flips. 
Since the sign of det(B) is maintained, the sign of p(C) flips; 


3. When a row of C is b' +ac', the corresponding row of CB is b! B+ac' B. Since the denominator 


does not change, p is linear with respect to each row. 


This new function p satisfies all three conditions in Definition 8.1, implying that p is the matrix 
determinant. In other words, p(C) = det(C) for all C. Thus, when C = A, 

det(AB) 
A) = —__~~ = det(A 
(A) = Saray = det A), 
from which we see that the determinant of the product of two matrices is equal to the product of the 


determinants of two matrices. 


E 
If A were invertible, det(A) 4 0, according to Fact 8.5. Using AAT! = I with Fact 8.6, we get 
1 
det(A~+) = det(A) “t = —. 
et(A?) = det(A) = 
The inverse and determinant commute, and so do the transpose and determinant. 
Fact 8.7 For a square matriz A, det(A') = det(A). 
Proof: Expanding det A! along (8.2) results in (8.3). a 


Thanks to Fact 8.7, all conditions and properties of the determinant in terms of rows can be rephrased 
in terms of columns. For example, a column swapping changes the sign of the determinant, and deter- 


minant function is linear with respect to each column. 


8.2 Formulas for the Determinant 


b 
Let us use (8.2) to compute the determinant of a small matrix. Consider first a 2 x 2 matrix A = . 
c d 
Because (a,b) = (a, 0) + (0,6) and (c,d) = (c, 0) + (0, d), we can compute the determinant of this matrix, 
following 
0 
det(A) = det ' + det 
c d c d 
a 0 a 0 0 b 
= det + det + det + det 
c 0 0 d c 0 


8.2. Formulas for the Determinant 


= ad-— be. 


Along this line, let us try to compute the determinant of the following 3 x 3 matrix: 


411 412 413 


A= a21 422 423 


This can be done as follows: 


ay, 0 0 0 ay 0 0 0 ag 
det(A) = det Jaz, az2 a3} +det Jaz, az? az3| +det Jaz, az2 23 
431 Q32 4033 431 432 433 a31 Q32 Q33 
au 0 0 au 0 0 ai. 0 0 
= det|an 0 0 | +det] 0 ag O | +det] 0 0 az 
431 432 433 a31 Q32 433 431 Q32 433 
0 ag 0 0 ag 0 0 ag 0 
+det las, 0 0 | +det| 0 a% 0 | +det] 0 0 az 
431 Q32 Q33 431 432 33 a31 432 433 
0 0 ap 0 0 ays 0 0 arg 
+det laa, 0 0 | +det| 0 a% 0 | +det}] 0 0 az 
431 432 433 431 432 433 431 432 433 
au 0 0 i0 0 au 0 0 
= det] 0 ag 0| +det| 0 az 0| +det| 0 ag 0 
az 0 0 0 az 0 0 0 a33 


t+det | 0 0 W| tdt] 0 O a3} +det| 0 0 az 


a31 0 0 0 a32 0 0 0 a33 


+det 


S&S 
© 
Q 
N 
w 
+ 
a 
© 
ct 
=| 
= 
[=] 
N 
w 
+ 
Q 
© 
e 
oO 
=) 
Q 
N 
w 


0 a12 0 0 a12 0 0 Q12 0 
+det a21 0 0| + det a21 0 GF det a21 0 0 
+det | 


155 


156 Chapter 8. Determinants 


+det 0 + det 0 a22 0 + det 0 a22 
0 a32 
1 0 0 1 0 0 0 1 0 
= 411422433 X det |0 1 0| +.411@23a32 x det |0 0 1] + aı2a21a33 x det |1 0 0 
0 0 1 0 1 0 0 0 1 
[o 1 0] [o 01 [o o 1] 
+ a1209343; X det |0 0 1] +aı3a21a32 x det |1 0 0| + aı3a22a31ı x det |O 1 0 
1 0 0 0 1 0 1 0 0 
7 5 a10(1)020(2)430(3) det (Pe) 


o: permutation of {1,2,3} 


= 411422433 X 1 + a11423432 X (—1) + a12a210a33 X (—1) 
+€12023031 X (—1)? + 413421032 X (-1)? + 413422031 X (—1) 
= 411 (22433 = 423432) — 12 (a21433 = 423431) + 413 (421432 = a22031) 


Q22 Q23 a21 423 a21 422 
= ax det — Q12 X det + a13 X det 


a32 433 431 433 431 432 


n 


= XO) Ha det(A1;). 
j=l 

Aj; above is an (n — 1) x (n — 1) submatrix of A after removing the first row and the j-th column. 
The red-colored terms above cancel out or vanish to zero, and only the blue-colored terms may remain 
non-zero. 

The last line above generalizes this computation to an n x n matrix from the 3 x 3 matrix, though 
we will not discuss any explicit proof here. While taking this as correct, let us try to prove the following 
result showing that this expansion could be done with any arbitrary row rather than the first one: 


n 


S31) ay, det(A1;) = XC (—1) H a;i; det(Ajy). 

j=1 j=1 
Similarly to Aı1;j, Ai; is an (n— 1) x (n— 1) submatrix of A after removing the i-th row and j-th column. If 
we use A to denote a matrix resulting from swapping the first and i-th row of A, det(A) = —det(A). We 
now let Aj; be the (n—1) x (n—1) submatrix of A after removing the first row and the j-th column, in fact, 
which corresponds to rearranging all rows but the i-th one in A in the order of 2,...,(¢—1),1, (¢+1),...,n 
and removing the j-th column. The first row of A is the (i —1)-th row in A,;. If we swap this row (i— 2) 
times to move it to the top of Ai;, we end up with A;;. As the sign of the determinant flips every time 
a pair of rows swap, det(A,;) = (—1)*~? det(Aj,). This results in the desired expansion: 


n n 


det(A) = X` (-1)'aj; det(Ai;) = SiC) Hay (1)? det(A;j) = XC (—1)Hta;; det (Aij). 


j=1 j=1 j=1 


8.2. Formulas for the Determinant 157 


Using a shorthand notation Cj; = (—1)'*’ det(Aj;;), to which we refer as a cofactor, we get the following 


cofactor expansion of the determinant given any i: 


det(A) = 5S 1) ai; det(A o aij. ij- 


j=1 


Fact 8.8 Let A be ann x n square matrix. For any i, 


det(A) = 5 aigCij = 5 ajiCji r 
j=1 j=1 


Proof: We already showed that det(A) = Le 1 &ijCij. If we apply this to A‘, (—1)**/ det(A';;) = 
(-1)**9 det (Aj; ") = (—1 its det (A;;) = Ciji Therefore, 


n 
det(A") = X (A")i;(—1)**7 det(A" ;;) = ayiCyi- 
j=l 
E 
This tells us that cofactor expansion does not only work for an arbitrary row but also for an arbitrary 
column. 


Fact 8.9 Ži akjCij =0 fori k. 


Proof: We construct B from A by replacing the i-th row with the k-th row. In other words, the i-th 
and k-th rows of B are the same, meaning that det(B) = 0. All the entries of A and B are the same 
except for the i-th row, and thus the cofactors of the i-th row also coincide with each other. That is, the 


B’s cofactor is also C;j. Then, the cofactor expansion of B with respect to the i-th row is 


det(B p bijCiy = 3 anjCi; = 0, 


which proves the statement. 


Example 8.1 Let us compute the determinant of (the second difference matrix) 


For an appropriate elementary matrix L, A= LD = L Tete, , and therefore, det A = 


158 Chapter 8. Determinants 


8.2.1 Determinant of a Block Matrix 


Consider the determinant of the following (nı + n2) x (nı + n2) square block matrix consisting of Ay, 


and Ag of sizes nı X nı and no x n2, respectively: 


_ Ai 0 
0 Ags 
a 
If both of these blocks were of size 1 x 1, as in 1 , the determinant would be a11đa22 = 


a22 
det([a11]) det([a22]). From this observation, it is natural to deduce 


det(A) = det(Ai1) det(A22). 


Let o be a permutation of {1,..., nı + n2}, and cı be a permutation where only the first nı indices 
are permuted and the rest {n1 + 1,...,Nnı + n2} are maintained (as an identity). On the other hand, 
o2 is a permutation that only permutes the latter ng indices {n1 + 1,...,n1 + n2}, while keeping the 
rest {1,...,,} as they are. When P, is the permutation matrix of ø, sign(a) = det P,. We make one 
important observation here. 01 0 02 is a permutation over {1,..., nı + n2}, although the first nı and the 
latter ng are permuted within each other only, implying that c1 0 02 = o2 0o o1. This also implies that 
such a permutation must be expressible in the form of cı 0 o2. Furthermore, any permutation o, that 
cannot be expressed as c1 © o2, must map either i < nı to a(t) > nı or i > nı to a(t) < nı. Since such 
a pair (i,0(i)) corresponds to an entry in A outside Aj; and Ago, the block diagonal structure imposes 
Qio(i) = 0. We can hence limit ourselves to only permutations in the form of o1 o 72 to compute the 


determinant of A. 


Let S be a set of permutations of {1,...,n1 + n2}, Sı a set of permutations of {1,..., nı}, and Sy a 
set of permutations of {nj + 1,...,m1 + n2}. Then, we get the following expression for the determinant 
of A: 

det(A) = D Aio(1) *** @(ny+n2)o(nitna) Sign(c) 

oES 


= X @101002(1) eee Q(ni+nz2)o1002 (n1+n2) sign(o1 O 02) 
01 ES1,02ES2 


= 5 104 (1) ae Oni o4(n1) (ni 4+1)02(ni41) aia (ni +n2)o2(ni+n2) sign(o1) sign(o2) 
01€S1 02€S2 


E 5 @10,(1) °° * nioi (nı) sign(o1) 5 Q(ni+1)o2(n1+1) `` * @(ni+n2)o2(n1+n2) sign(o2) 
o1€S81 o2€S2 


= det(Ai1) det (A22). 
Next, consider the following block lower-triangular matrix: 


A 0 
a= Po 
Agi Age 


8.2. Formulas for the Determinant 159 


Taking into account that the determinant of a regular matrix is expressed as the product of diagonal 


entries, we can make an educated guess that that of the black matrix must be 
det (A) = det(Ai1) det(A22). 


If det(A11) = 0, rank( A11) < nı, implying that rank(A) < nı + nz and det(A) = 0. In this case, the 


expression above holds. When Aj, is invertible, we can perform Gaussian elimination on A, as follows: 


li 0 An Of} |An 0 


—Ag Ay; | | Aor A2 0 Ase 


Because the first matrix on the left hand side is a lower-triangular matrix with a unit diagonal, the deter- 

minant is 1. We already showed earlier that the determinant of the right hand side is det(A11) det (A22). 

Combining these two, we see that the equation above holds for the block lower-triangular matrix. 
Finally, consider a general block matrix, where A41 is invertible. We can perform Gaussian elimination 


to remove Ag, as follows: 


Ay 0 Ay Ate Thy Aj A2 


— A2 Ay In2| | A2 Age 0 Az- Aoi Ay A12 
If we consider the determinants of both sides of the equation, we get 
det(Aj,') det(A) = det(Ag2 — A21 Ay) A12). 
Since the determinant of the inverse is the inverse of the determinant of the original matrix, we arrive at 
det(A) = det(A11) det( A22 — A21 AJ} A12), (8.4) 


where A22 — Ao Ay A12 is the Schur complement. 


8.2.2 Matrix Determinant Lemma 


We derive the matrix determinant lemma from the results on the determinant of a block matrix above. 
From there on, we continue to deriving Sherman-Morrison, or Woodbury formula that explains how the 
inverse changes when we modify an invertible matrix by adding a product of lower rank matrices, in 
Section 8.4. 

Although there is no intuitive way to explain why we choose these three matrices, let us consider the 


product of the following three matrices. 


I, O| |f,+UV' U I, 0 
VI k 0 I,| -V 2, 


where U and V are both n x k matrices. Because the product of the first two matrices is 


I,+UVl U 
Vi+viuv! IĻ+V'U 


160 Chapter 8. Determinants 


the product of all three matrices is 


I, +UV' U In 0 Ly, U 
Vi+viuv' £4+V'U| |-v' i 0 i+V'U 


As the determinants of the original expression and the final expression must coincide, we get 
det(In +UV') =det(I, +V! U), (8.5) 


from which we derive the following general result. 


Theorem 8.2 (Matrix Determinant Lemma) Let A be an nxn invertible matrix, and U and 


V ben x k matrices. Then, 


det(A + UV") = det(A) det(Iy + V! ATU). 


Proof: Since A+UV! = A(I, +A7~'UV'"), by replacing U with A~1U in (8.5), we get 


det(A+UV") 


det(A(I, + AUV! )) 
= det(A)det(I, +A 1UV") 


det(A) det(, +V'A™~'U) by plugging A~'U into U in (8.5). 


E 
In addition to using this result for computing the determinant, it is also used often to show that the 
inverse of A+ UV! exists only when J, + V'A7!U is invertible. When k = 1, U and V are respectively 


R” vectors, u and v, and the matrix determinant formula simplifies to 


det(A + uv!) = det(A)det(1+v'A7!u). (8.7) 


8.3 Applications of Determinant 


8.3.1 The Volume of a parallelopiped in R” 


Consider n vectors, a1,...,@,, in R”, and let us compute the volume V of the n-dimensional parallelopiped 
defined by the origin and these vectors. We assume these vectors are linearly independent, as otherwise 


the volume vanishes. 


e Assume aj,...,a, are orthogonal, and let A = [a; | ... | an]. Since these vectors are orthogonal, the 
volume V corresponds to the product of their magnitudes/norms. That is, V = []/_, lai|. By Fact 
8.7, we know that 


’ 


det(A)? = det(A" A) = JJ lai? = V? V = | det(A) 
i=l 


8.3. Applications of Determinant 161 


because 
ja]? 0 0 0 
0 la|? 0 0 
A'A=]| 0 0 
lan— 1 |? 0 
0 O =: 0 janl? 
e We now relax the constraint on a1, ...,an so that they are linearly independent but not necessarily 


orthogonal. We use QR decomposition to re-write A as QR, where Q is an orthogonal matrix 
and R is an upper-triangular matrix. Once we computed the volume of the (i — 1)-dimensional 
parallelopiped defined by {a1,...,a;-1} together with the origin in the (i— 1)-dimensional subspace 
span{q1,...-,qi—1}, the contribution to the volume of the i-dimensional parallelopiped by a; is by 
its orthogonal component to span{qi,...,q:-1}. Since the i-th column of R is the coordinate of 
a; under the basis {qi,...,q:}, the absolute value of Ri = (a;,q;) is precisely the contribution by 
a; to the parallelopiped’s volume in the i-dimensional subspace. Thus, according to Fact 8.4, we 


know that 


n 


v=|[] Ri 


i=l 


Since A! A = R'Q'QR= R' R, det(A)? = det(R)?. Therefore, V = | det(A)]. 


= | det(R)]. 


8.3.2 Computing A`! 


b 
One observation we draw by considering a 2 x 2 matrix A = S is that 
c 
—1 
ait? b -1 d —b __! Ci Ca 
c d ad — be =¢ a det(A) Ci2 Co 


where Ci; is a cofactor. If we multiply both sides with A, 


1 Cy, Ca 1 a b| |Cy, Coy 


I = = 
det(A) Cie Co det(A) c d Cie Co 


Let C = (Ci;) be a cofactor matrix. Then, we see that 


= T 
det(A) 


for an 2 x 2 matrix A. We generalize this result to an arbitrary invertible square matrix by using the 


cofactor expansion with respect to an arbitrary row 1, 


det(A) = `X QijCij = lain, iets ,Gin| (Ci, ote Oal 
j=l 


162 Chapter 8. Determinants 


and Fact 8.9, which results in 


a11 Q12 Pay Ain Ci C21 TF Chi det(A) 0 RS 0 
Q21 Q2 ... Gan Cie Coo eee Che 0 det(A) see 0 
ACT = . . s . i . = . . 
Qni Qn2 --- Ann Cin Con Ppa Cart 0 0 ... det (A) 
= det(A). 
In short, 


8.3.3 Cramer’s Rule: Solution of Ax = b 


When A is an invertible matrix, we can use (8.8) to derive the solution to Ax = b as follows: 


If we replace the j-th column of A with b, the determinant is }`;—; Ci;b;. This allows us to derive the 
following Cramer’s rule: 


1 
t= aap) 1 [| 1h |aj-1|b]aj+ı | av | a 


The Cramer’s rule is concise and is useful for mathematical reasoning, but is not computationally efficient 


enough for solving real-world linear systems. 


8.4 Sherman-Morrison and Woodbury Formulas 


Consider the inverse of A+UV', where A is an n x n invertible matrix, and U and V are n x k matrices. 
We want to express this inverse using the inverse of A, which is assumed to be known. According to the 
matrix determinant lemma (8.6), the inverse of A+ UV' exists only when I; + V'A7~1!U is invertible. 
Let us start with the inverse of Ip +UV'. Note that Ip +UV' is invertible if and only if Ip +VU' 
is invertible. Try J, +UBV' as the inverse with an appropriate choice of B (though, it is definitely not 


intuitive how we decide to start from here.) Then, 


(In +UV')(In +UBV") =I, +UV' +UBV' +UV'UBV' =1,+U(h+B+V'UB)V',” 


8.4. Sherman-Morrison and Woodbury Formulas 163 


which tells us that a sufficient condition for the invertibility of I, +UV' is Ik +B+V'UB=0. Then, 
from 


-k=B+V'UB=(h+V'U)B 


we get 


B=-(,+V'U)"* 


According to the matrix determinant lemma, the invertibility of J, +V'U is the necessary and sufficient 
condition for the invertibility of In +UV'. That is, I, +V'U is invertible, and by plugging it into the 
original form I, + UBV! , we get 


(In +UVT) = In -U (Ip + VTU) VT. (8.9) 
Now, let us derive the inverse of A + UV": 


(A+UV')* A(In + AUV T)) 


( 
(In + AUV!) 1AT! 
( 


I, — A'U (Ip + V' AU) V!) At 
= ne Vener ye 


According to this rule, we can compute the inverse of an n x n matrix by only computing the inverse of 


a smaller k x k matrix, if we know A`! already. 


Theorem 8.3 (Woodbury Formula) Let A be ann xn invertible matrix, and U and V benx k 
matrices. Then, I, +V'A7~1U is invertible if and only if A+ UV" is invertible. Then, 


(CO SSN ey VTA 


Proof: It is clear from the matrix determinant lemma (8.6) that these two conditions are necessary and 
sufficient conditions of each other. We already derived the inverse above. 
| 
This theorem states that the invertibility of Iy +V'A71U is a sufficient condition for the invertibility 
of A+ UV". 
It is important to consider the case of k = 1. In that case, U and V are R” vectors, u and v, 


respectively, and the necessary and sufficient condition for the invertibility of A+uv' is 1+v' A™tu Æ 0. 


Corollary 8.1 (Sherman-Morrison Formula) Let A be an n x n invertible matriz, and u and 
v be R” -vectors. Then, 1 +v' A™tu £0 if and only if A+ uv! is invertible. Then, 


A-tuv! A`! 
1+v!'A-tu’ 


(A +uv') sA 


164 Chapter 8. Determinants 


8.5 An Application to Optimization: Rank-One Update of In- 


verse Hessian 


In machine learning, the process of learning is often implemented as optimization. That is, it is the 
process of iteratively finding a set of parameters that minimizes or maximizes the learning objective 
function. Let us use w € R” to denote the parameter vector. A representative example is the parameters 
of a neural network from Section 3.10. We use f(w) to denote the learning objective function, which is 
often referred to as a loss function. We prefer it to be smaller, which means that learning corresponds to 
solving w* = argmin,, f(w). Among numerous algorithms that have been proposed to solve this problem, 
we consider a symmetric rank-one update based algorithm. 

Most nonlinear optimization algorithms aim to find w* that makes the gradient vanishes to 0, i.e., 
V f(w) = 0. In doing so, consider the first-order derivative (the gradient vector V f) and the second-order 


derivative (the Hessian matrix V? f): 


Vf= , H=Vf= 


ð ð ð ð ə 
Own a Own Ow, f t Own Own f 


The second-order Taylor approximation to f near wę is given as 


f(w) © f(we) + Vf (we)! (w-— we) + sw — w) H(wz)(w — we). 


Because a aus f(w) = Ba a f(w) for most of functions f, H(w) is symmetric, and we can approximate 


the gradient by 
Vi(w) © Vi (we) + H(we)(w— we), 


according to Fact 4.19. In order to find w that satisfies Vf(w) = 0, we solve the following secant 
condition 


Vfw) + H(wk)(w-—- w) =0 


and get w = wy — H (wp) Vf (wx). 
Since we started from approximation, V f (w) 4 0 in general, and we thus iteratively compute next-step 


vectors Wķ+1 until the iteration converges to the desired solution w*, as follows: 
Whi = We — H (w) VS (we). (8.12) 


We call this procedure the Newton-Raphson method. Under suitable conditions, we can show that wk 
converges to w*, and the convergence rate is fast if we can compute all quantities exactly. It however 
becomes computationally infeasible to compute the inverse of the Hessian matrix every time as the number 
of parameters n grows. Here, we thus seek a similar iterative algorithm that does not assume access to 


the Hessian matrix H(w;,) but only to the gradient. 


8.5. An Application to Optimization: Rank-One Update of Inverse Hessian 165 


First, we replace H(w;,,) with a similar, symmetric matrix Bx in (8.12), resulting in 
Wk+1 = Wk —- By 'V Ff (we): (8.13) 


After we determine w+, with this rule, we find Bk+1 that closely approximates the second-order deriva- 


tive H(wg+1) using the gradients at wọ and wz41, by solving 
Vi (we+1) — Vif (we) = Bk+1(Wk+1 = wg). (8.14) 


Because this system has n equations while the number of entries of n x n symmetric matrix to be 
determined is order of n?, there are many solutions of By41. We call optimization algorithms that use 
any of these solutions collectively as quasi-Newton methods. 

Among these quasi-Newton methods is the symmetric rank-one method, where we simply add a 
rank-one matrix to B, to obtain Bk+1. The update rule for Bk+ı in this method is 


1 
(dk = Br Awg)! Awk 


Bk+1 = Bk + (dk BpkAwg)(dk 4 B, Aw) ', (8.15) 


where Aw, = Wk+1 — wy and dy = V f (wk+1) — Vf (wy). Show yourself that Bp+ı satisfies (8.14). 

Because we use (8.13) to update the parameters at each iteration, we need to know Bg}1 1. We can 
use the Sherman-Morrison formula (8.11) and get the following update rule to compute By4,~* directly 
from Br? without Bk+1: 


1 


Bea = Bkt 
ai i (Aw, = B dx) dy 


(Awk B~ dx) (Awe = By, 'dx)!. (8.16) 


Once we know B`}, we only need to add a rank-one matrix to efficiently compute B,41~‘. In practice, 
it is usual to maintain and update B, directly rather than B,. It is however more convenient to use 


(8.15) for analyzing mathematically the update rule. 


166 


Chapter 8. Determinants 


Chapter 9 


Further Results on Eigenvalues and 


Eigenvectors 


Although we introduced eigenvalues and eigenvectors in Definition 5.1, we waited until this section to 
delve deeper into these two concepts, as it requires the notion of determinants. 

What is a simple interpretable transformation? As we have seen in Section 3.8.2, scaling transfor- 
mations would be the simplest one. Then, for the linear transformation corresponding to a matrix A, 
is there any way to interpret the matrix through scaling transformations? It is difficult to get such an 
interpretation on the whole space at once, but there may be hope if we search for a one-dimensional 
subspace on which the transformation can be understood as a scaling. This amounts to finding a scaling 
scalar À and a vector v such that A = AJ on a subspace spanned by a single non-zero vector v, that is, 
for vectors xv in the one-dimensional subspace, A(xv) = (AI)(av) = xAv holds. This relation is simply 
Av = Xv and we say A is an eigenvalue and v an eigenvector as well as (A, v) an eigenpair. There can be 
more than one eigenvectors per eigenvalue.! For each eigenvalue, we can span a subspace by all associated 
eigenvectors, on which the transformation works as a scaling by the eigenvalue. This story still holds for 
complex matrices with complex eigenvalues and eigenvectors. 

It is difficult to find both eigenvalue and eigenvector at once since the term Av is not linear in 
unknowns. So, we first find eigenvalues and then search for eigenvectors. When A — XJ is singular, that 
is det(A — AJ) = 0, Av = Av admits non-zero solution vector v. This allows us to find an eigenvalue by 
treating À as a variable and solving det(A — AI) = 0. We call this equation a characteristic equation, 
which is useful for theoretical derivation and abstract reasoning. This is however not helpful in computing 
eigenvalues of a general matrix in practice. With an eigenvalue in hand, we find eigenvectors as a basis 
of Null (A — AT). 


We know that the n-th order complex polynomial equation admits n roots, including simple and 


1We refer to the maximum number of linearly independent eigenvectors for an eigenvalue as the geometric multiplicity. 


Kang and Cho, Linear Algebra for Data Science, 167 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


168 Chapter 9. Further Results on Eigenvalues and Eigenvectors 


multiple roots. Based on this (though we do not prove it here), every n x n matrix has n eigenvalues, 
again including multiple roots. In the appendix, we summarize a minimal set of results on complex 
numbers needed for studying this chapter. 

In the rest of the chapter, we present the spectral decomposition of a real symmetric matrix in Section 
5.5 once more, as this is the most popular eigendecomposition result in applications. In Chapter 3, we 
discussed that a linear transformation has a corresponding matrix representation given a basis of a vector 
space. When we can build a basis consisting only of the eigenvectors of such a matrix, the same linear 
transformation under this basis corresponds to a diagonal matrix. We refer to the process of finding such 
a diagonal matrix by diagonalization. Although diagonalization does not work for all matrices, we will 
study in Chapter 11 that each linear transformation admits a corresponding Jordan-form matrix under 
the choice of an appropriate basis, where a Jordan form refers to a diagonal matrix or a matrix similar 


to a diagonal matrix. 


9.1 Examples of Eigendecomposition 


We can study various cases and properties of eigenpairs using a 2 x 2 matrix. 


3 2 
1. A real matrix with two real eigenvalues: When A = , det(A— AI) = (8—A)(—2—A) +4 = 
2 


A? — A— 2 = (A- 2)(A +1). The eigenvalues are thus À = 2 and —1, and their corresponding 


n T + f 1 2 4 2 
eigenvectors are (2,—1)' and (1, —2)  , respectively, because A— AI = and ; 
—2 —4 —2 -1 


0 
2. A real diagonal matrix: When A = ‘ At the eigenvalues are A = a and b, because det(A— AI) = 
0 


(a — d)(b— A). 


0 0 a—b 0 
e a+b: the eigenvectors are (1,0)' and (0,1)', because A— AI = ; and ; 
0 b-a 0 0 


e a=b: Any arbitrary pair of linearly independent vectors can be eigenvectors, since A — AI = 
0 0 
0 of 
3. A real upper-triangular matrix: When A = ay , C #0, the eigenvalues are \ = a and b, because 
0 b 
det(A — AI) = (a — à) (b — A). 


e a # b: The eigenvectors are (1,0)! and (1,2=*)', because A — AI = and 


C 


9.2. Properties of an Eigenpair 169 


a—b c 
o ol 
c 
e a=b: The eigenvector can be any scalar multiple of (1,0)! , because A — AJ = . That 


is, this matrix has only one eigenvector. This is an example of a real asymmetric matrix that 


does not possess as many eigenvectors as the number of roots (algebraic multiplicity). 


4. A real matrix with two complex eigenvalues: When A = , the eigenvalues are ÀA = i 
1 
and —i, because det(A — AI) = \? +1. The corresponding eigenvectors are (—i,1)' and (i,1)', 
l l i -1 -i <i 
respectively, since A — AI = and 
1 i 1 —i 


5. Eigendecomposition of a non-trivial projection matrix P satisfying P? = P (neither I nor 0): 
Consider (A, v) from Pv = Av. By multiplying both sides with P, we get P?v = APv which 
simplifies to Pv = A?v, since P? = P and Pv = Av. Thus, à = 0 or 1, as (A? — A)v = 0. Since 
a non-trivial projection matrix satisfies 0 < rank(P) < n, any vector in the null space of P is the 
eigenvector corresponding to the eigenvalue 0. Since a vector v already projected onto a subspace 


satisfies Pv = v, such a vector is the eigenvector corresponding to the eigenvalue of 1. 


From the examples above, we can deduce that eigenpairs can come in many different ways. 


9.2 Properties of an Eigenpair 
We start with the linear independence of eigenvectors associated with distinct eigenvalues. 


Fact 9.1 If eigenvectors vı,...,Vg correspond to different eigenvalues à1,..., Apg, then the eigenvectors 


are linearly independent. 


Proof: Suppose that {v1,...,Vẹ} is linearly dependent. Then, there exist c,,...,c, E€ R with at least 
one of them being a non-zero scalar, that satisfies c1vı + -+ + CkVk = 0. As we can permute the order 
within {v1,..., Vg}, we assume cı Æ 0 without loss of generality. By multiplying both side of the equation 
by A, we get A(civi +--+ + CkVk) = C1À1V1 +++ + CkAkVk = 0. We then multiply the original equation 
with A; and subtract it from this new equation to get (c1 À1V1 ++ ++ +Ck-1Ak—1Vk—1 FE CKARVE) — (C1ÀkV1 + 
ee + Ck-1AkVk—1 + ChARVR) = C1(A1 — Ak) Vi +++ + Ck-1(AÀk-1 — Ak)Vk-1 = 0. Since c1(A1 — Ax) Æ 0, 
{vi,...,V%—-1} is also linearly dependent. If we repeat this argument k — 2 times more, we end up with 
cv, = 0 where c) = c1(A1 — Ak) tee (Ar — As) (Ar — Az). Then, c) = 0 and cı = 0, which is contradictory. 
Therefore, {v1,...,v,} is linearly independent. 

a 


170 Chapter 9. Further Results on Eigenvalues and Eigenvectors 


It is also interesting that the determinant and the trace of a matrix can be expressed only in terms 


of its eigenvalues. 

Fact 9.2 Let A = (aij) be ann x n matrix and ài, ..., Àn be its eigenvalues. Then, 
e det(A) = ArAg° An = [[ A; 
e trace(A) = a11 + Gao +++: + ann = à HA2 + H An = Di Ai 


Proof: 
According to (8.2), det(A — AI) is an n-th order polynomial of A, and the coefficient of its n-th term 


is (—1)”. Since the roots of this polynomial are eigenvalues, 
det(A — AI) = (—1)"(A — Ar) + (A= An) = (ài — A) 4 An = A). (9.1) 
e If we set A = 0 in (9.1), det(A) = A1 ++- An. 


e The entries of A — AI that contain À are only on the n diagonal entries, a; — A. Let us do the 
cofactor expansion of B = A — XI along the first row. Consider the summand b; det(B,;) for i > 2 
where B4; contains n — 2 diagonal entries? including À except for a1; — À and aj; — À. Therefore, 
the order of À in det(B,;) is at most n — 2 since det(B,;) sums up the products of entries chosen 
from different rows. Therefore, the \"~! is contained only in b;; det(Bi1) = (a11 — A) det(Bi1). 
If we repeat the same argument to identify \"~? in det(B,1) and so forth, we conclude that \"~+ 
appears only in (a11 — À) -++ (ann — A) among the cofactors of det(A — AJ). Here, the coefficient of 
APTI is (-1)""1 (a11 + +++ + ann), and hence trace(A) = Ay + Ag +++ + Àn since the coefficient of 
A"! in det(A — AT) is also (—1)"~1 (Ay + +++ + An) by expanding (9.1). 


E 

Because we define the eigenvalue through determinants, some properties of determinant have coun- 

terpart observations for the eigenvalue. For example, since the determinants of A and A! are the same, 
det(A — AI) = det(A! — XJ), and therefore, the eigenvalues of A and A! are also shared. 


Fact 9.3 The eigenvalues of A and A! coincide. 


By algebraically manipulating Av = Av in various ways, we can derive properties of eigenpairs. Let 
(A, v) be an eigenpair of A. Then, (A?,v) is an eigenpair of A?, because A?v = A(Av) = A(A\v) = AAv 


dv. 


Fact 9.4 If (A,v) is an eigenpair of A, then (A7,v) is an eigenpair of A?. 


When A is invertible and has an eigenpair (A, v), A Æ 0 from the invertibility of A and A~'v = \~!v 
by multiplying A~1A~! to both sides of Av = Av. Therefore, (A~+,v) is an eigenpair of A7!. 


?Recall that Bij is a submatrix built by removing i-th row and j-th column. 


9.2. Properties of an Eigenpair 171 


Fact 9.5 If A is invertible and (A,v) is an eigenpair of A, then X 4 0 and (A~1,v) is an eigenpair of 
Att. 


Interestingly, if a complex matrix is Hermitian,’ which includes the case of a real symmetric matrix, 


all eigenvalues are real. 


Lemma 9.1 Let A be a Hermitian matriz. Then, all eigenvalues of A are real. Furthermore, if A is a 


real symmetric matrix, not only real eigenvalues but also real eigenvectors exist. 


Proof: Let A be Hermitian and (A,v) an eigenpair of A. Recall that |v| Æ 0. From Av = Av, 


vH Av = Av4y = Alv|?. So, 
AKAT = Ally? = (e" Ay" av ANG" )" ay" Av = Aly? (Tis, TRA 


which implies that » is real by Fact E.1 in Appendix E. In addition, let us assume that A is real and 


(A, v + iw) be an eigenpair where À is real, and v and w are real n-vectors. Then, the eigenpair satisfies 
Av + iAw = A(v + iw) = A(v+ iw) = àv + iAw 


which implies (A, v) and (A, w) are also eigenpairs. E] 
Similarly to the linear independence of eigenvectors for distinct eigenvalues in Fact 9.1, if a matrix 
is Hermitian, eigenvectors of distinct eigenvalues are not only linearly independent but also orthogonal. 


Recall Fact 4.4 saying that the orthogonality of vectors implies their linear independence. 


Fact 9.6 Let A be Hermitian. If eigenvectors vı and v2 correspond to different eigenvalues A, and Xo, 


those eigenvectors are orthogonal. 


Proof: By Lemma 9.1, the eigenvalues are real. From Av, = àıvı and Av2 = A2va, vil Avy = Avivi 


and vil Avo = Avivo. Since A, Æ À2 and 
0 Sg Aye — vi Av = (Aayi) — Aviv, = (àg —Aa)vivi, 


we conclude viv, = 0. a 
If a matrix is orthogonal, we can observe that the absolute values of the eigenvalues are all 1, and the 


eigenvectors associated with distinct eigenvalues are orthogonal as for Hermitian matrices. 


Fact 9.7 Let Q be a real orthogonal matrix and (A, v) an eigenpair of Q. A and v may be complex-valued. 


Then 
1. |A| = 1. 


2. (A7!,v) is an eigenpair of Q'. 


3We define the Hermitian matrix and also discuss various results on complex numbers in Appendix E for your review. 


172 Chapter 9. Further Results on Eigenvalues and Eigenvectors 


3. If eigenvectors vı and v2 correspond to different eigenvalues 1 and A2, then those eigenvectors are 


orthogonal. 
Proof: Recall that Q'Q = I and Qv = Av. 
1. [v]? = v"v = v" (Q' Q)v = (Qv)"(Qv) = (Av) (Av) = |A?|v/? and |A}? = 1. 
2. v= (2' divw=Q' Avi=Ag' ys Bo, CO'v= iy, 
3. Let Qvı = àıvı and Qvə2 = A2v2. Then, 


vive = vi (Q'Q)v2 = (Qvi) (Qv2) = A} Azvi v2 


and we must have AHA = A1à2 = 1 or viiv = 0. However, 1 = |Ai|? = APA, = A1A1 implies 
Alà = 32 #1 since A; Æ Ag. Therefore, viiv = 0. 


Fact 9.8 If A is a real matrix and (X,v) is an eigenpair of A, then (X,¥) is also an eigenpair of A, where 


AÀ and ¥ are the complex conjugates of à and v, respectively. 


Proof: By taking the conjugation on both sides of Av = Av, we get Av = Av. By complex arithmetic, 

we see that Av = AV. Since A is a real matrix, A = A, and thus AV = AV. Therefore, (\,¥) is also an 
eigenpair. 

E 

Using the results from Section 7.5, we can express the volume of an arbitrary n-dimensional ellipsoid 

as the multiple of the volume of a unit sphere. We further characterize this in terms of the determinant 


of the positive definite matrix inducing the ellipsoid. 


Fact 9.9 For a positive definite matriz A, an n-dimensional ellipsoid is given as E = {x € R” : x! A7!x < 


1}. Then, 
vol(E) = y det(A) vol(Bn). (9.2) 


Proof: By Fact 9.2, Jj, VA:(A) = ydet(A), and therefore, vol(£) = \/det(A) x vol(Bn). 


9.3 Similarity and the Change of Basis 


We say that two matrices are similar when they represent the same linear transformation under appro- 


priate choices of bases. 


9.3. Similarity and the Change of Basis 173 


9.3.1 The Change of Basis 


Given a basis 6, of a vector space V, we want to consider another basis Bj. In order to translate the 
properties expressed by the basic vectors in 6, into the descriptions in terms of basic vectors in B2, we 
must first establish the relationship between B, and Bo, using the ideas developed in Section 3.8.1. 

Let By = {vi,..., Vn} and By = {w),...,w,}. Since By is a basis itself, we can express each v; as the 


linear combination of the basic vectors of Bz, as follows 


n 
Vj = X bijWi. (9.3) 
i=1 
With this, let us try to compute the coordinate vector y € R” of the vector v with respect to Bz, when 
its coordinate vector under B, is x = (£1,...,£n)! € R”, that is, 
n 
v= 5 TjVj 
j=1 


Plugging in (9.3), we get 
j i=1 j= 


With a square matrix B = (b;;), we can write the relationship between x and y as 
n 
Yi = 5 bijzj or y= Bx. (9.5) 
j=1 


Since b,; relates the i-th and j-th basic vectors from two bases, (9.5) holds for all vectors with the B = (6;;) 
as long as we fix the two bases. 
We can swap the roles of B; and Bz and repeat the argument above to obtain another square matrix 


B that maps y to x, that is x = By. Together with (9.5), 
x = BBx foralxeR” and y=BBy forally €R”, 


which implies that BB = BB=I,. In summary, the matrix that represents the change of basis is 
invertible, and two matrices that represent the mapping between two bases are inverses of each other. 

Conversely, we can build a new basis 6’ = {v1,...,v,} satisfying (9.3) once a basis B = {w1,..., Wn} 
and an arbitrary invertible matrix B = (b;;) are given. It is enough to check whether 6’ is linearly 
independent to confirm that it is really a basis. Assume a Ajvj = 0 with A = (Aj,..., An)! € R”. 
Then, 


n 


j=l j=l i=1 1 


i=1 j= 
holds, and hence ea bijàj = 0 for all i, that is, BA = 0 since w1,...,Wņn are linearly independent. 


Therefore, A = B710 = 0 and B’ is linearly independent and a basis. 


174 Chapter 9. Further Results on Eigenvalues and Eigenvectors 


9.3.2 The Change of Orthogonal Basis 


Consider the case where both 6; and Bg are orthonormal bases. Since wj we is 1 when 7 = £ and otherwise 


0, we get 


Vv} Vk = 63 bijwi) : ( 5 bexwe) = ( D biw] ) ( 5 bexwe) 
i=1 (=1 i 


= 5 5 bijberwi we 


i=1 f=1 
n 

= > bijbik. 
i=1 


V] Vk is also 1 only when j = k and otherwise 0. Because )7}"_, bj;bix is the (j, k)-th entry of B! B, 
B'B=In. 


In other words, B is an orthogonal matrix, and the change-of-basis matrix between two orthonormal 


bases is orthogonal. 


9.3.3 Similarity 


Consider a linear transformation T : Y — V defined in a vector space V and its two bases 6B; and Bo. 
Let A be the transformation matrix of T with respect to Bı such that the B,-coordinate of T(v) is Ax 
where v’s 6,-coordinate is x. In addition, B is the change-of-basis matrix between 6, and B2. We now 
derive the matrix representation of T with respect to Bo. 

According to (9.5), the B,-coordinate of a vector whose B2-coordinate is y is B-ty. And, the Bı- 
coordinate of vector y after transformation by T is then AB~'y. Since the Bo-coordinate of a vector 


whose B,-coordinate is AB~'y is BAB~'y. 


De Bip — eeo 
B! B 
/ 
By: y ———__4 _,. sy = BAB-y 


If we let A’ be the transformation matrix of T under B2, we arrive at A’y = BAB™'y applies to all 
y € R”, which implies 
A = BAB™ and A=B'A'B. 


As A and BAB™! represent the same transformation T under two different bases, respectively, we can 


view them as equivalent in terms of how a vector transforms in V. We thus say they are similar. 


9.4. Diagonalization 175 


Definition 9.1 Square matrices A and A’ are similar if there is an invertible matric B such that 


A=B= AB. 


Fact 9.10 Jf A and A’ are similar, they share the same eigenvalues. 
Proof: Suppose that A = B~'A’B and (\,Vv) is an eigenpair of A satisfying Av = àv. Then, 
A'(Bv) = (A'B)v = (BA)v = B(Av) = B(Av) = A(Bv) 


holds, and implies that (A, Bv) is an eigenpair of A’. a 


9.4 Diagonalization 


A diagonal matrix is considered the simplest form of a matrix. If we can find a basis with respect 
to which a transformation matrix is diagonal by the change of basis, this can help us understand and 


analyze the corresponding linear transformation. When this is possible, we call such a transformation 


matrix diagonalizable. 


We first investigate the most basic necessary and sufficient condition for a diagonalizable matrix. 
Fact 9.11 Ann xn matrix A is diagonalizable if and only if A has n linearly independent eigenvectors. 


Proof: Define a diagonal matrix A as 


à 0 0 
0 xX 
A =diag(A1,...,An) = 4 
i in OB, 
e if: Let (Ai,v1),-.--;(An, Vn) be n eigenpairs with linearly independent v,’s, although some eigen- 
values may coincide. With invertible B = [vı | -++ | vn] and Av; = Aivi, we know AB = BA and 


that B~!AB = A. Thus, A is diagonalizable. 


e only if: Because A is diagonalizable, there exist an invertible B and a diagonal matrix A such that 
A = B-'AB. Because then ABT! = BTH, each column of B7! is an eigenvector. Since BT! is 


invertible, its columns are linearly independent. 


E 
We can ask when the eigenvectors of two matrices coincide perfectly. There is no known result for 
two n x n matrices in general. We however know that the necessary and sufficient condition for two 


diagonalizable matrices, A and B, to share their eigenvectors is for them to commute, that is, AB = BA. 


176 Chapter 9. Further Results on Eigenvalues and Eigenvectors 


Theorem 9.1 Let A and B be nxn diagonalizable matrices. Then, A and B share the same eigenvectors 


if and only if AB = BA. 


Proof: Let A and B share the same eigenvectors and denote the square matrix of those shared 


eigenvectors as S. Then A = S~'A,S and B = S~'AdS. Since diagonal matrices commute, 
AB=S-*Aghs-*igS = 8 Agha S Agds = Ages “AS = BA. 


Conversely, assume that diagonalizable matrices A and B commute. Let S be an invertible matrix of 
eigenvectors of A such that A = S~!AS. Set B = S~!BS. The commutativity of A and B implies the 
commutativity of A and B. Assume A = diag(AiIn,,.--;AnZn,,) where each In, is an identity matrix of 
size n; such that n = nı +---+ ng. Of course, A;’s are assumed to be different from each other. Denote 


the matrix B by a partitioned matrix (Bi;) where By is an n; x nj matrix. Then, since 


MB Mbi o MM Biz AB àbi © An Bar 
` d2Bo1 E P ABa A2Boo +++ Akbar 
AB = ; ; ; and BA = , ; i ; 

AeBer Arbia = ApBer MBer Abra -o Akbar 


AB = BA holds if and only if By = 0 for i Æ j.l hatis, B= diag(Ê11, a ., Bur): From the diago- 
nalizability of B, B is also diagonalizable, and each Bi is diagonalizable in turn. Let some S; satisfy 
Bu = SAS. where A; is a diagonal matrix. Denote S= diag(S1, Khas Sk) and A = diag(Ay, or Ar) 
such that B = S—1AS. By recalling the common block structure of Sand A, we also observe A = S-1A8. 


If we combine these, we obtain 
A=S"S=0""p (48S and Bas ÉS = S18 Ass. 


Therefore, A and B share the common matrix (S)! of eigenvectors. | 
According to the real spectral theorem (Theorem 5.2), a symmetric matrix is diagonalizable. Further- 
more, since symmetric matrices satisfy (AB)! = B' A! = BA, AB = BA is equivalent to the symmetry 


of AB when A and B are both symmetric. Combining these two, we get the following result. 


Corollary 9.1 Let A and B ben x n symmetric matrices. Then, A and B share the same eigenvector 


matrix if and only if AB is symmetric. 


When the eigenvectors of a diagonalizable matrix are orthogonal, we can find orthonormal eigenvectors 


by diagonalization. When this is possible, we call such a matrix orthogonally diagonalizable. 


Definition 9.3 A square matriz A is orthogonally diagonalizable if there exists an orthogonal 


matriz Q such that QT!AQ = Q' AQ is diagonal. 


9.4. Diagonalization 177 


When an n x n matrix A is orthogonally diagonalizable, there exists an orthogonal matrix Q, i.e. 
Qt = Q! that diagonalizes A by Q~!AQ = A, where Q = [v,|v2| = |v, ] and A = diag(\1,...,An)- 
Not all \,’s may be different. Recall (3.6) in Corollary 3.1, and we see that we can rewrite an orthogonally 


diagonalizable matrix A as a sum of rank-one matrices induced from orthonormal vectors, as follows: 
A=QAQ™ =X Xvi, . (9.6) 
i=1 


We can see that such an orthogonally diagonalizable matrix is symmetric. Furthermore, spectral decom- 
position tells us that a symmetric matrix is orthogonally diagonalizable as well. Combining these two 


observations, we learn that this property is unique for symmetric matrices. 


Theorem 9.2 (The Fundamental Theorem of Symmetric Matrices) A real matrix A is 


orthogonally diagonalizable if and only if A is symmetric. 


Invertibility and Diagonalizability in the Lens of Eigenpairs 


Every n x n matrix has n (including multiple roots) complex eigenvalues. From these eigenvalues, we can 


make the following observations. 
e If all eigenvalues are non-zero, the matrix is invertible; 


e If there are n linearly independent eigenvectors, the matrix is diagonalizable. When an eigenvalue 
d is a multiple root of the characteristic equation det(A — xI), this equation contains (x — \)* 
with k > 1, and we say the algebraic multiplicity is k. Although we are not proving it here, the 
number of linearly independent eigenvectors corresponding to each 4, to which we refer as geometric 
multiplicity, is at most k. The sum of the algebraic multiplicities of all eigenvalues coincides with 
the number of either rows or columns of the matrix. A matrix is diagonalizable if the sum of the 


geometric multiplicities matches the number of columns. 


An Example of a Non-diagonalizable Matrix 


1 1 
Let us consider A = . As its characteristic polynomial is p4(a) = (x — 1)”, A has one eigenvalue 
0 1 


A = 1, and its algebraic multiplicity is 2. The null space of A has dimension of 1, since A— AJ = A — I = 


0 1 
and 2—rank A = 1. In other words, A has one eigenvector, and the geometric multiplicity of the 


eigenvalue 1 is 1. Since the geometric multiplicity is smaller than the algebraic multiplicity, this matrix 
is not diagonalizable. 
Let us show that this matrix is not diagonalizable without relying on the relationship between two 


a b 
types of multiplicity. Assume the existence of an invertible matrix B = that diagonalizes A, that 
c d 


178 Chapter 9. Further Results on Eigenvalues and Eigenvectors 


d —b BAB = 1 ad + cd — be d? 
—c a -2 ad — cd — bc 


For the latter to be diagonal, both c and d must be 0, in which case B is singular. This contradicts the 


1 


is, B7! AB is a diagonal matrix. Because B~! = ate 


invertibility assumption, and therefore A is not diagonalizable. Although it is not diagonalizable, we will 
see later in Section 11.4 that we can express such a non-diagonalizable matrix in a form that is similar 


to a diagonal matrix, called the Jordan form. 


9.5 The Spectral Decomposition Theorem 


A real n x n matrix A may have complex eigenvalues, even if it contains purely real entries. A Hermitian 
matrix on the other hand only has real eigenvalues, according to Lemma 9.1. Furthermore, a real 
symmetric matrix, which is a special case of a Hermitian matrix, does not only have real eigenvalues 
but also n real eigenvectors that are orthogonal. These results and observations were already implied in 
Theorem 5.2. 


(The Real Spectral Decomposition Theorem revisited) Let A be a real symmetric matriz. 


Then, A is orthogonally diagonalizable. That is, 


AT VAV! = Da Viv, ; 
i=1 


where V is an orthogonal matrix with orthonormal columns v1, V2,...,Vn, |Vvi]| = 1 and A = 


diag(A1,..-,An)- 


See Appendix F for the proof that does not rely on SVD. 


9.6 How to Compute Eigenvalues and Eigenvectors 


We can find an eigenpair (A, v) of an n x n square matrix A by solving 
Av = àv 


for A € C and v € C”. It does not matter whether A consists solely of real values, since its eigenvalues and 
eigenvectors may very well contain complex numbers. If we are forced to find eigenvalues and eigenvectors 


without using a computer, we can try the following procedure: 


e Carefully inspect the matrix A to find any clue that allows us to readily compute eigenvalues and 


their corresponding eigenvectors. 


e If there is no such clue, we find eigenvalues by solving the characteristic equation det(A — AJ) = 0. 
Once eigenvalues À are found, we can use Gaussian elimination to find (n — rank(A — AJ)) linearly 


independent vectors that span Null (A — XJ). 


9.7. Application to Data Science: Power Iteration and Google PageRank 179 


If A were symmetric, we can narrow down the search spaces for eigenvalues and eigenvectors to be 
real only. More specifically, we can use real spectral decomposition (Theorem 5.2) to express the matrix 
A as the sum of rank-one matrices, as in (5.12), and exploit this structure to identify eigenvalues and 
eigenvectors. Regardless, there is no one standard approach to solving this eigenvalue problem. 

If we are allowed to use computers, we can use any of many numerical methods, such as the power 
method and QR method based on QR decomposition. We refer readers to any text book on numerical 


analysis. 


9.7 Application to Data Science: Power Iteration and Google 


PageRank 


Let x; = (a},27,...,2%)' € R” describe the state of a system at time t. This system’s state at time t 
can be expressed as either 


x, = A’xy, t=1,2,... 


in the case of discrete-time system, or 


tas? pi € ys t>0 
Fa + 3 +--+) x0. = 


x, = e4xy = (1+ T 


in the case of continuous-time system,* when the system’s dynamics is given correspondingly as 
Xt+1 = Ax, t= 0, 1,2)... 


or 


d 
ast = Ax, t> 0. 


Either way, we need to compute AF for k = 1,2,.... If A were diagonalizable, i.e. A = VAV—}, it is 


conceptually easy to compute this, since 
A = VAV, 


and matrix exponential is similarly simplified to 


t ? t 
etA =Vv(I+ At GA? + gat o = Vdiag(e™*, 2"... et) Vt, 


Even when A is diagonalizable, it may not be feasible to compute eigenpairs if the matrix is large. 
Here, we consider approximately computing the eigenpair with the largest absolute eigenvalue, which is 
the core idea behind Google’s PageRank. In Google’s PageRank, web pages are ranked based on the 


entries of the eigenvector associated with the eigenvalue of the largest magnitude. 


4We refer to e? as the matrix exponential of a square matrix B and define it as 


eB copie 1 B24 1p 4...= 31 Be 
Wo ar BI oa 


180 Chapter 9. Further Results on Eigenvalues and Eigenvectors 


Let A= VAV~!, V = [vi|va|---| vn] be a diagonalizable matrix, and further assume that |v;| = 1 
for all i and |Aq| > |A2g| > |A3| > .... First, sample a vector x from a Gaussian distribution. With 
probability 1, w; # 0 where w = V~!x, because x follows a Gaussian distribution.” Let w, > 0 (if 
necessary, we can multiply w with —1.) As we have seen already, A*V = VA*, which allows us to get 

n 
A"x = AkVw =VAtw = So wivi. 
i=1 
If we continue to rearrange terms to be explicit about A; and vı, 


n 
Akx = widivi + 5 wire v; 


i=2 
waf (vi > PORO] 
i=2 


If we use zy, to denote the summation term inside the parentheses on the right hand side, 


II 


n 
ITAR 

lel < DE w 

7 w1 Al 

1=2 

gza ni M1 

Az |F Diino lwil 

Al wil ` 


As TA X; |w:| > 0 and p = Fa < 1, it is guaranteed that limk—oo 2, = 0, although the convergence 


can slow down with a large p.° 


Now we design a pratical algorithm to estimate the eigenvector of the largest eigenvalue. Because 


Akx wi Až (vi T Zk) _ Vi T Zk 


|AFx| wy Flv + z| [vi + 2p’ 


in the limit of k + oo, this quantity converges to the correct vector, as follows: 


lim Afx : vı + Zk 


— > 1 — = 
k= oo |AFx| ES [vi T Zk] Ki 


The algorithm then can be described by the following three steps: 
1. x9 =x ~ N(0, T); 

2. Yk+1 = Axe, Xkt1 = Roget 

3. Qk+1 = Xl AK +1: 


By repeating the second step, we get X — v1. In the case of the third step, limp. a, = A1, because 


o (vit zp) | A(vi + Zk) e (vi +22 )(Aivi + Azk) oA vi Azk + Arai vi + Z} AZ 
[vi + zg]? [vi + 24? [vi + 24? l 


5Tn fact, it is fine to use any continuous distribution to sample x. 
SLater in Example 10.5, we explain how we can add a rank-one matrix to make p < 0.85 to avoid slow convergence. 


9.7. Application to Data Science: Power Iteration and Google PageRank 181 


By repeating these two steps, the algorithm converges to the eigenpair with the largest eigenvalue. This 
algorithm is computationally efficient, as each iteration requires matrix-vector multiplication only rather 


than matrix-matrix multiplication. 


182 


Chapter 9. Further Results on Eigenvalues and Eigenvectors 


Chapter 10 


Advanced Results in Linear Algebra 


We discuss some remaining results on linear algebra and matrix theory. An exploration of a dual space 
of a vector space provides many insights into the original vector space. The dual space of a finite- 
dimensional vector space is characterized very well, but the characterization of the dual space for an 
infinite-dimensional vector space is tricky and covered in a functional analysis course. The transpose is a 
popular operation on matrices, but it is difficult to relate it to some aspect of the linear transformation 
corresponding to the matrix. It can be done implicitly by the adjoint of the linear transformation in 
a vector space equipped with an inner product. This implicit description of adjoints explains why we 
could not find an analogy for a transpose in a direct set-up. Through this adjoint, we can explain why 
a projection matrix should be symmetric. As an important result for the probability theory, Perron- 
Frobenius theorem says that a matrix with only positive entries must have a positive eigenvalue and an 
eigenvector with only positive elements. We also discuss the Schur triangularization saying we can find a 
unitary basis through which an arbitrary matrix is similar to an upper triangular matrix whose diagonals 


are eigenvalues. 


10.1 A Dual Space 


We call a linear map from a vector space V to a real R or complex C space a linear functional. That is, 
a linear functional is a scalar-valued linear function. We then refer to a set of all linear functionals on V, 
that is, 

V* = {f : f is a linear functional on V} 


as a dual space. This dual space V* is also a vector space, as the sum or the scalar multiple of a linear 
functional on V is also a linear functional. Then, we want to know the dimension of this vector space V* 
when dim V = n. 

We can think of R as a one-dimensional vector space and use 1 as its basic vector. Then, a linear 


functional f : V > R is a linear map from an n-dimensional vector space to a one-dimensional vector 


Kang and Cho, Linear Algebra for Data Science, 183 
©2024. (Wanmo Kang, Kyunghyun Cho) all rights reserved. 


184 Chapter 10. Advanced Results in Linear Algebra 


space. When By is a basis of V, there exists a 1 x n matrix A such that f(v) = Ax, where x € R” is 
a coordinate of v € V with respect to By. When a € R” is the only row vector of A, ie. A = [a!], 
f(v) = a'x. In other words, we can view a linear functional in an n-dimensional vector space as an 
n-dimensional Euclidean vector. 

Already in Section 3.8.1, we showed that a; = f(v;) where By = {v1,...,v,}. Using this fact, we 
now show that the correspondence between V* and R” is injective. If two linear functionals, f and g, are 
different, there has to be at least one basic vector vj for which f(v;) Æ g(v,;). Then, a; # bj if we use a 
and b to denote the vectors corresponding to functionals f and g, respectively. That is, two vectors in R” 
corresponding to two different linear functionals in V* differ. Therefore, this correspondence is injective. 


Furthermore, if we define h(v) = a'x given an arbitrary a € R”, i.e., 


n n 
n( X xiva) = X Aiti, 
i=1 i=l 


it is clear that h is a linear map from V* to R, which implies that this relationship is also surjective. The 
correspondence between a linear functional in V* and a R”-vector is bijective. 

Let us use fj € V* to denote a linear functional corresponding toa = e; € R”. That is, fj ( ey zivi) = 
e] x = zj for v = J ;—; TiVi To check the linear independence of f;’s, assume that i ajf; = 0,4 
functional equality, holds for some a,;’s. Then, it holds that De aj fj) (v) = jai Os f;(v) = 0 for any 
v € V. This is equivalent to 77, aj fj (dy tivi) = jai 07%; = 0 for any x = (21,... Ean) ER”. 
Therefore, it must be that aj =--- =a, = 0. In other words, {f1,..., fn} C V* is linearly independent. 
V* = span{fi,..., fn} is easy to see from the surjectivity of the correspondence. So {fi,...,fn} is a 
basis of the dual space V*, to which we refer as a dual basis, and therefore dim V* = n. This allows us 
to treat V* and R” as if they were the same. We can furthermore consider the dual space of V*, that is, 
(V*)* = V**, since V* is itself a vector space. 

A linearly constrained problem in optimization corresponds to finding a vector that maximizes or 
minimizes an objective function while satisfying a set of equalities and inequalities defined using linear 
functionals. We define a new dual variable for each linear functional in the original (primal) problem 


and re-define the optimization problem, which we call a dual problem. This serves an important role in 


optimization. 


10.2 Transpose of Matrices and Adjoint of Linear Transforma- 
tions 
Take two vector spaces, V and W, with inner products defined on them. We use (-,-)y and (-,-)w to refer 


to them, respectively. We then define the adjoint of a linear transformation, which can be thought of as 


a function version of matrix transpose. 


10.2. Transpose of Matrices and Adjoint of Linear Transformations 185 


Definition 10.1 A function f : W —> V is an adjoint of a linear transformation T : V > W if 


(T), w)y = (v, f(w))y (10.1) 


for allv € V and w € W. 


We can show the uniqueness of adjoint using the following result derived from the inner product’s 
property. 
Fact 10.1 Let f and g be functions from W to V. If (v, f(w))v = (v, g(w))y for allv € V and allw € W, 
then f =g. 


Proof: For v* = f(w) — g(w) € Y, 


implies f(w) — g(w) = 0. E 
Using this result, we can show the uniqueness of adjoint, as shown below in Fact 10.2. It is however 
important to remember that the adjoint of a linear transformation may not exist in an infinite-dimensional 


vector space, as shown in Example 10.2. 
Fact 10.2 If a linear transformation has an adjoint, then it is unique. 


Proof: If f and g are adjoints of a linear transformation T, (v, f(w))y = (v,g(w))y for all v € V 
according to (10.1). Then, f = g by the result in Fact 10.1. 

E 

When the adjoint of a linear transformation T exists, we use T* to refer to it. When T* = T, we say 


T is self-adjoint. T* is also a linear transformation, as shown below. 
Fact 10.3 If a linear transformation has an adjoint, then it is also linear. 
Proof: Assume T* exists. Then, for any v € V and w1, w2 € W, 


(v, T*(aw, +w2))y = (T(v),awı + wo)w 
= (T(v),aw1)w + (T(v), w2)w 
= a(T(v),wi)w + (T(v), w2)w 
= av, T*(w1))v + (v, T” (w2))v 
= (v,aT"(w1))v + (v, T” (w2)}v 
= (vaT (w) + T™(wa))v- 

Since this holds for any v € V, T*(aw; + w2) = aT*(w1) + T* (w2) due to the result from Fact 10.1. 


Therefore, T* is a linear transformation. 
, 
E] 


186 Chapter 10. Advanced Results in Linear Algebra 


Example 10.1 Consider a vector space consisting of polynomials, that is, P = {ao + aya +---+an2": 


a; ER for 0 <i<n,n=0,1,2,...}. We define an inner product as 


(f, g9)e = | f(x)g(a)de. 


Let us find the adjoint of a linear transformation T,(f) = pf given a fixed p € P. Since 


(Th(f); 9)P 


II 


(pf, g9)e 
- f (p(x) f(2))g(2)dz 


T> = Tp, which implies that T, is self-adjoint. 


Example 10.2 [Non-existence of Adjoint] Consider the vector space P from Example 10.1 and a linear 
transformation T (f) = f’, where f’ is the derivative of f. Assume the adjoint T* exists. Let h = T* (g) € 


P, where g(x) = 1 is a constant function in P. From the basic result in calculus, we get 
1 
(To= Fede = F) FO, 
and for any f € P, it must hold that 
1 1 
J far =ta)-s0) = | Fæha)dz. 
0 0 


If f(x) = x?(a — 1)?h(ax) such that f(1) = f(0) = 0, then 


0= i E T. pes RO A: 


In this case, for q(x) = a(x — 1)h(x) € P, q = 0 since (q, q)p = 0, and hence T* (g) = h = 0.1 Hence, for 
any f €P, 

f(1) — £(0) = (T(F), 9)e = (f, T* (g9))e = (f, 0)p = 0, 
which is equivalently to f(1) = f(0). We can however find a contradictory example of f, such as 
f(x) =a € P. Therefore, the linear transformation that corresponds to differentiation does not have an 


adjoint. 


We now consider the relationship between the adjoint and matrix transpose. 


lif h(x) takes a non-zero value for any interval longer than 0 within [0, 1], the integral above cannot be 0, and since h(x) 


is a polynomial, it must be that h(x) = 0. 


10.2. Transpose of Matrices and Adjoint of Linear Transformations 187 


Fact 10.4 Consider vector spaces Y = R” and W = R™ with the standard inner products. Let A be an 
mxn matrix. Then, the adjoint of T(v) = Av is T* (w) = A! w. 


Proof: Because 


(T(v),w)w = (Av) w =v! A'w = (v, A' w)y, 


T*(w) is an adjoint of T(v). By Fact 10.2, this is the adjoint of T. 
E 
Because of this fact, we often use transpose and adjoint interchangeably. Furthermore, in a finite- 
dimensional space, the matrix representing a self-adjoint operator is symmetric, and thereby we use 
self-adjoint and symmetric interchangeably as well. 
Finally, in a finite-dimensional vector space, any linear transformation can be represented as a finite- 
size matrix. The adjoint of such a linear transformation is then represented by the transpose of this 
matrix, according to Fact 10.4. That is, there exists always the adjoint of a linear transformation between 


finite-dimensional spaces. 


10.2.1 Adjoint and Projection 


We learned in Fact 4.16 that an orthogonal projection matrix P is symmetric and satisfies P? = P. We 
call a matrix or a function idempotent when the square of the matrix or the composition of the function 
with itself are the same as the original one. It is natural to see that the projection matrix is idempotent, 
since a vector should not change once it has been projected onto the target subspace. Then, how does 
the symmetry of P relate to the fact that it is projection? 

Although we imply orthogonality when we say projection, projection does not have to be orthogonal. 
We can very well project a vector in a 2-dimensional space to a one-dimensional subspace along the 
45°-tilted line. We can then ask what is the right way to express orthogonal projection more explicitly. 
Let P(-) be a linear transformation that corresponds to the orthogonal projection onto the subspace W 


from the original vector space V in which an inner product (-,-) is defined. For any v,w € V, we get 
0 = (v — P(v), P(w)) = (v, P(w)) — (P(v), P(w)), 


because v — P(v) L W and P(w) € W. If we swap the roles of v and w, we also get (P(v),w) = 
(P(v), P(w)). Combining these two together, we find 


(P(v),w) = (P(v), P(w)) = (v, P(w)), 


which implies that orthogonal projection as a linear transformation is self-adjoint. The orthogonal 
projection matrix is symmetric since the matrix associated with a self-adjoint transformation in finite- 
dimensional space is symmetric due to Fact 10.4. 

Conversely, let an idempotent P is self-adjoint. In other words, for any v,w € Y, it holds that 
(P(v),w) = (v,P(w)). If we replace v with P(v) and w with v, we end up with (P(P(v)),v) = 


188 Chapter 10. Advanced Results in Linear Algebra 


(P(v), P(v)). Since P is idempotent, this simplifies to (P(v),v) = (P(v),P(v)). By rearranging the 


terms, we arrive at 


and P is an orthogonal projection. 


We summarize these observations into the following lemma. 


Lemma 10.1 An idempotent linear transformation represents a projection. The projection is orthogonal 


if and only if the transformation is self-adjoint. 


10.3 Further Results on Positive Definite Matrices 


Despite some of the restrictions, such as the lack of commutativity in multiplication, square matrices are 
similar to real numbers, as addition and multiplication are both well defined for them. Among square 
matrices, positive definite matrices correspond to positive real numbers and share with them various 
properties. When we model a complex system with many variables, some of the variables are often 
expressed as square matrices. If they were positive definite matrices, we can use numerous results on 
them to conveniently analyze such a system. 


Before we continue, we list up symbols that are handy when dealing with positive definite matrices: 
e R™”: a set ofm x n matrices with real entries; 
e S”: a set of n x n symmetric matrices; 


e St: aset of n x n positive semi-definite matrices. A = 0 if A € S17; 


e St: a set of n x n positive definite matrices. A > 0 if A € St}; 


e A> B (resp., A > B) if and only if A — B > 0 (resp., A — B = 0). 


10.3.1 Congruence Transformations 


The notion of similarity in Definition 9.1 was about the same linear transformation expressed in two 
different bases; if transformations are the same, the corresponding matrices are similar. On the other 
hand, here we define the notion of congruence as the preservation of positive (semi-)definiteness of a 


coefficient matrix of a quadratic form after linearly transforming the variables. 
Theorem 10.1 Let A € S” and B € R™™, and consider the product 
C=B' ABES”. (10.2) 
1. If A= 0, then C = 0; 


2. If A> 0, then C > 0 if and only if rank B = m; 


10.3. Further Results on Positive Definite Matrices 189 


3. If B is square and invertible, then A > 0 (resp. A = 0) if and only if C > 0 (resp. C = 0). 
Proof: For x € R”, set y = Bx € R”. 
1. x! Cx = y! Ay > 0 since A> 0; 


2. IfrankB=m,y=0 © x=0. x'Cx=y! Ay>0Oifx 0. That is, C > 0. Conversely, if C > 0, 


then x = 0 & y =O, which implies B has a full-column rank, i.e., rank B = m; 


3. If B is square and invertible, rank B = m. By applying 1 and 2 to both (10.2) and A = B~'CB™1, 


we obtain the results. 


E 
We call (10.2) a congruence transformation. Compare this congruence transformation with the 


similarity transformation in Definition 9.1. 
Corollary 10.1 For any matrix B € R™”, it holds that: 


1. B' B > 0 and BB' +0; 


2. B! B > 0 if and only if B has full-column rank, i.e., rank B = n; 
3. BB! > 0 if and only if B has full-row rank, i.e., rank B = m. 


Proof: By setting A = I» in Theorem 10.1, we obtain all results. a 
The complexity of a mathematical model involving a matrix can often be lowered if the matrix is 
diagonalizable. Below, we present important results on two simultaneously diagonalizable matrices and 


on the diagonalizable product of two matrices. 


Theorem 10.2 (Joint diagonalization by similarity transformation) Let A,, Az E S” and 


A= a, Ay, + az Á >=0 


for some scalars a, and az. Then, there exists a non-singular matrix B € R™” such that B'A,B 
and B' AB are diagonal. 


Proof: If both a, and az vanish, A ~% 0. Hence, at least one of them should be non-zero, and we 
assume that a2 Æ 0. Since A> 0, A= CTC for some invertible matrix C. If we plug it into the original 


equation, we get CT C = a, A; + a2 Ag. We re-write it as 
n= mC T A0! $ aC AC F 


Since CTT A;C7! is still symmetric, symmetric spectral decomposition guarantees C71 A107! = QAQ! 
where Q is orthogonal and A is a diagonal matrix with eigenvalues as its diagonal entries. We multiply 


Q' and Q on both sides of the equation and obtain 


In = Q' nQ = aQ! CT ACT + a2Q' C7! ACIQ = aA + a2Q'C7' ACE. 


190 Chapter 10. Advanced Results in Linear Algebra 


Since a2 Æ 0, we can modify the above equation as 
1 
—(In — a A) =Q'c-'ACc™'!Q, 
a2 

where the right side is diagonal because the left side is diagonal. Hence, if we set B = C~'Q, then B is 

invertible and diagonalizes both A, and Ag. i 


Corollary 10.2 Let A> 0 and C € S”. Then, there exists an invertible matrix B such that B'CB is 
diagonal and B' AB = In. 


Proof: If we apply Theorem 10.2 to A = 14+ 0C > 0, then there exists an invertible Ê such 
that both B' AB and B'CB are diagonal. By the third property from Theorem 10.1, BT AB > 0. 
Since B' AB = diag(\;), A; > 0 for all i. Let D = diag(/X;), and set B = BD~!. Then, B' AB = 
D-! ÊT ABD~ = D“!D?D~! = I,, and B'CB = D~!1B'CBD7! is diagonal, since both B'CB and 
D~? are diagonal. a 


Corollary 10.3 Let A,B € S with A > 0. Then, the matrix AB is diagonalizable and has real 


eigenvalues only. 
Proof: Let A!/? be the square root of positive definite A. Then, 
Aa /2 ABA‘? = A? BA., 


The matrix on the right side is symmetric, diagonalizable, and has real eigenvalues. Since AB and 
A'/?BA!/? are similar (as in Definition 9.1), both share the same eigenvalues as well as their diagonaliz- 


10.3.2 A Positive Semi-definite Cone and Partial Order 


The convexity is an important structure we use to investigate and describe mathematical objects.” It 
often makes analysis and optimization easier and more intuitive to approach when we understand the 
convexity of target objects. In this section, we are particularly interested in the convexity of a cone, 
where a cone is defined as a set of half-infinite rays, such as the set of all positive real numbers and the 


first quadrant of a real plane. 


Definition 10.2 Let V be a vector space. A subset K of V is a cone if Av € K forallv € 


K and all A> 0. A subset of V is a convex cone if it is conver and a cone. 


In the context of matrices, positive (semi-)definite matrices form a convex cone. It is intuitive to draw 
this conclusion by noticing the similarity between the positive definiteness of matrices and positivity of 


real values. 


If you are not familiar with convexity, refer Appendix A. 


10.3. Further Results on Positive Definite Matrices 191 


Fact 10.5 S} and Sty are convex cones. 


Proof: It is clear that S% is a cone. For A, B € S%, x'(AA+(1—A)B)x = Ax! Ax + (1 — \)x' Bx > 0, 
since x' Ax > 0 and x! Bx > 0. Hence, AA + (1—)B €S%. It is parallel to show that S", is a convex 
cone. kii 

A major difference between either S} or S% and positive real values is whether we can tell one is 


greater than the other given a pair of elements. In the former case of matrices, we often cannot tell this. 


Example 10.3 Can we impose an order on positive semi-definite matrices in S}? Let us try to borrow 


the positive semi-definiteness introduced earlier, A > B if and only if A— B = 0. Indeed, the transitivity 


2 1 
holds, as A > C if A > B and B > C. Consider however the following matrices: A = ; 
1 1 
1 1 1 1 
B= C= . It holds that A > B and C > B, but neither A > C nor C > A hold between 
1 1 1 2 


A and C. In other words, there may not be an order defined using > between two positive semi-definite 


matrices. Therefore, = defines only a partial order on S”. 


When a > b for two positive real numbers, a and b, ba~! < 1. We derive a similar property for two 


positive semi-definite matrices below. 


Theorem 10.3 Let A> 0 and B = 0, and denote by p(-) the spectral radius of a matrix (that is, 


the maximum absolute value of the eigenvalues of a matrix). Then, 


AVE & CBN) I 


A=B & p(BA™)<1. 


Proof: By Corollary 10.2, there exists an invertible matrix M such that M-T AMT! = I, and 
M~'BM~! = D = diag(d;) is a diagonal matrix. Note that d; > 0 since B,D € S%. We apply 
Theorem 10.1 to 

A-B=M'I,M-—M'DM=M'(I,—-D)M>0 


and obtain that In — D > 0. Therefore, d; < 1 for all i and p(D) < 1. 
Since B = M'DM and A~! = M-!M—', BA™! = M'DMM-!M—' = M' DM", that is, D and 
BA? are similar to each other. By Fact 9.10 D and BA™! share the same eigenvalues, and p(BA~') < 1. 
(10.4) can be proved parallel to the proof of (10.3). E 


For a pair of square matrices, A and B, AB and BA share the same set of eigenvalues, and thus, 
p(AB) = p(BA). Assuming A > 0 and B > 0, 


A>B & p(BA"')=p(A'B)<1 6 B> A7. 


192 Chapter 10. Advanced Results in Linear Algebra 


due to Theorem 10.3. This is similar to the relationship between a positive real number and its inverse. 

There are other inequalities induced from the partial order relations of positive semi-definite matrices. 
Let us see an exemplary case. Consider two positive semi-definite matrices A and B. By Fact 7.8, the 
following holds for all i if A > B: 


Ai(A) = Ai(B + (A— B)) > Ai(B). 
From this relationship, we derive the following useful inequalities: 


det A 


i 
= 
> 
= 
IV 
=a 
Z 
& 

i 

a 
w 


n 
traceA = 5° A,(A) > XD A(B) = trace B. 
i=1 i=1 
In other words, matrix determinant and trace are monotonic functions on S% with respect to >. 
As another example, we obtain the following result on the symmetric sum, which appears often in 


fields such as in control theory, useful as well. 


Theorem 10.4 (Symmetric Sum) Let A > 0 and B € S”, and consider their symmetric sum 


S = AB + BA. 


Then, S = 0 (resp., S > 0) implies that B = 0 (resp., B > 0). 


Proof: Since B is symmetric, we can write B as B = Q'AQ with an orthogonal matrix Q = 
[qilq2|---|dn] and a diagonal matrix A = diag(\;(B)), consisting of B’s eigenvalues, according to 
the spectral decomposition theorem. According to Theorem 10.1, QT SQ > 0, as S > 0. We can expand 
Q'SQ as 


Q™SQ = Q'ABQ+Q'BAQ 


= Q'AQQ'BQ' +Q'BQQ'AQ 
= Q'AQA+AQ' AQ. 


We see the diagonal entries (Q'SQ)j; are non-negative, because (Q'SQ)i: = 2;(B)(QAQ')ii = 
2X;(B)q, Aq; > 0. Furthermore, ;(B) > 0 holds, as A > 0 and q} Aq; > 0. Thefore, B > 0. 


We can prove the case of S > 0 similarly. 
| 


Example 10.4 [Matrix square-root preserves the PSD ordering] Let A > 0 and B > 0. We can apply 
Theorem 10.4, because A’/? + B/? > 0 and Al/? — B1/? ES", to 


a A= B) = (a? + Bi?) (Av B?) | (Av? By) (Al? + BY?) 


10.3. Further Results on Positive Definite Matrices 193 


That is, A!/? — B1/? > 0, because 2(A — B) > 0 when A > B. Therefore, it holds that A!/? > B12. 
This is similar to how square root maintains the order of positive real numbers. 


2 1 12 1 
The converse however does not hold. Let A = and B= . Then, A> 0, B > 0 and 
1 1 1 0.9 


A> B. However, A? ¥ B?. 


These results sometimes allow us to derive new expressions or conditions, when we use them with 
block matrices. Let us consider the positive definiteness of block diagonal matrices. First, we see that the 
positive definiteness of a block diagonal matrix is determined by the positive definiteness of an individual 


diagonal block. Mathematically, 
M>O(resp.,M>0) & A>0,B>O(resp.,A>0,B> 0), 


On m 


Onn B 


when M = .3 We can refine this result in the case of a symmetric block diagonal matrix, 


as follow. 


Fact 10.6 (Schur Complement) Let A € S”, B € S™ and C € R®™ with B > 0. Consider a 
symmetric block matrix 
A C 
C! B 


? 


and consider the Schur complement S = A—CB7'!C" of A with respect to M . Then, 
M > 0(resp., M = 0) & S>O(resp.,S > 0). 


Proof: We get a block diagonal matrix after performing Gaussian elimination on the block matrix M, 


as follows: 
~ In -CB-! A C Ij 0 
LIME = 
0 Tin ov PI |-B-'e’ Iu 
S 0 fa 0 
C! B| |-B-'c!' I 
S Oon 
= = D, 
Om,n B 
fe 0 
where L = or . Because B > 0, the positive definiteness of D is equivalent to that of S. 
-B C' Im 


The positive definiteness of M is equivalent to that of D due to the third property in Theorem 10.1, since 


3We leave it for you to prove this. 


194 Chapter 10. Advanced Results in Linear Algebra 


L is an invertible lower-triangular matrix with all diagonal elements set to 1. Therefore, the positive 
definiteness of M is equivalent to that of S. 
a 


10.4 Schur Triangularization 


We present and prove Schur triangularization which is useful not only for proving further results later 
in this book but also in many real-world problems where the repeated product of a matrix is needed. 
Before doing so, we introduce a unitary matrix which is a complex version of the orthogonal matrix. A 
matrix Q is unitary when Q7! = Q", that is, QQ" = QYQ = I. When A = Q BQ" for a matrix A, 


A” = QB"Q", which is a useful property when computing the power of a matrix. 


Theorem 10.5 (Schur Triangularization) Let the eigenvalues of n x n matrix A be arranged 
in any given order 1, A2,---,An (including multiplicities), and let (A1,x) be an eigenpair of A, in 


which x is a unit vector. Then, 


(a) There is ann x n unitary matrix Q = [x|Q2] such that A = QUQ", where U = (uij) is 


upper triangular and has diagonal entries ui = ÀA; fori = 1,2,...,n. 


(b) If A, each eigenvalue and x are all real, then there is an n x n real orthogonal Q = [x|Q2] 


such that A = QUQ! , where U = (uij) is upper triangular and has diagonal entries uii = Ai 
Jor = Mh, yaaa ie 


Proof: We use mathematical induction to prove this theorem. First, (a) trivially holds when n = 1. 
Now assume (a) holds up to n—1. We construct an n x n unitary matrix Q= [x|V] from n unitary vectors 
including x (which can be obtained using for instance the Gram-Schmidt procedure.) V here denotes an 
n x (n— 1) matrix such that V"x = 0. Then, 

an Axx x"AV Ay xt AV 


OQ" AQ = Ĝĝ} Pix | AV] = x] AV] = = 
Me AVIS |g PAV ivi vav T |o vHaV 


because 


AQ = [Ax| AV] = [\ix| AV]. 


The eigenvalues of the upper-triangular block matrix on the right-hand side consist of the eigenvalues 
of individual blocks, that is, \; and the eigenvalues of V'AV. Because A and QUAQ are similar and 
thus have the same set of eigenvalues, the eigenvalues of V" AV are \2,...,An. Due to the assumption, 
it holds that V4AV = Qn_1Un_1Qh_,, where Qn_1 is an (n — 1) x (n — 1) unitary matrix, and Up_1 


is an upper-triangular matrix whose diagonal entries are \2,...,An. Using an appropriate choice of an 


10.5. Perron-Frobenius Theorem 195 


(n — 1)-dimensional a, we can rewrite Q4 AQ as 


H 
a Xi xH! AV 1 0 ` a” 1 0 
0 fd ty — in 4 0 Cont 0 Uni 0 Qni 
1 0 n 0 : ay ah |, 
Since Qn-—1 is unitary, both and Q =Q are unitary as well. U = is 
tii 0 Wasi 0 Upsi 


an upper-triangular matrix with A,,A2,...,A, on its diagonal. With Q and U, we see that A = QUQ", 
which proves (a). We can prove (b) similarly however with conjugation in (a) replaced with transpose. 

| 

With Schur triangularization, we can easily prove earlier results on computing the trace and determi- 


nant of a matrix using its eigenvalues. 

Corollary 10.4 Let ài, à2,..., An be the eigenvalues of ann x n matrix A. Then, 
trace A = Ay +Ag +- +An and det A = A12: An. 

Proof: According to Theorem 10.5, we can decompose A as A = QUQ". Then, 


trace A = trace(QUQ") = trace(UQ"Q) = trace U = \1 + Ag ++ + Àn, 


and 
det A = det(QUQ") = det Q det U det Q” = det(QQ") det U = det U = MA2 -+< Àn- 


10.5 Perron-Frobenius Theorem 


Let an n x n matrix A = (aij) satisfy aj; > 0 and ae aij = 1, that is, the sum of each row is 1, and 
all the elements are greater than or equal to 0. We call such a matrix a Markov matrix and use it 
to describe a probabilistic transition in a dynamical system known as a Markov chain. In this case, aij 
denotes the probability of the system transitioning from the i-th state to the j-th state. Because every 
row sums to 1, the Markov matrix always has (1,1) as its eigenpair. All the other eigenvalues are less 


than or equal to 1. 
Fact 10.7 p(A) =1 if A is a Markov matriz. 


Proof: Let (\,v) be an eigenpair of A. |vg| > 0 if k = argmax,<;<, |vi]. Let us consider the k-th 
equation in Av = Av, which is he AyjV; = Avg. Since A is a Markov matrix, ay; > 0 and ae tee = i, 


Then, |A| < 1, because 


nm n n 
[Allvel = vrl = | X avs < Do axglejl < D axglvel = lve. 
j=1 j=1 j=1 


196 Chapter 10. Advanced Results in Linear Algebra 


Since (1,1) is an eigenpair, p(A) = 1. 
E 
We call a vector p a probability vector if p > 0 and p'1 = 1. When p represents the distribution over 
the states of a system at a time, p' A is the distribution after one step of transition happened. Then, 
p'A = p! implies that the distribution over the system’s states does not evolve even after transition 
according to A. We call such a probability vector p a stationary distribution or an equilibrium distribution. 


We present the Perron-Frobenius theorem for a matrix with positive entries only. 


Theorem 10.6 (Perron-Frobenius) Let A = (a;;) be an n x n positive matrix, i.e., aj; > 0 for 


alli and j. Then, there exists a positive eigenvalue of the spectral radius of A, and its associated 
eigenvector with at least one positive element is unique up to scaling by a positive constant and 


has only positive components. 


Proof: 

Consider Ag = max{ : there exists x 4 0,x > 0 such that Ax > Ax}.4 Because A > 0, Al > 0. From 
this, we get that Al > \’1 and hence Ag > A > 0, where A’ is the smallest element in Al. With Xo, let 
Xo satisfy Axo > Aoxo. If Axo Æ AoXo, y = AXo — AoXo > 0 Æ 0. Using A > 0, we know that Ay > 0 
and that A(Axo) > Ao(Axo), because Ay = AAxo — Ap Axo. This means that we can find A greater than 
Ao for Axo that satisfies A( Axo) > A(Axo), which is contradictory to the definition of Ao. Therefore, 
AXo = ApXo and xọ > 0. 

Let (A,y) be an eigenpair with À Æ Ao. Let f be a vector with |y,| as its elements. If we take the 
absolute values of both sides of eat aijyj = Ayi, which is the i-th row of Ay = Ay, we get 


n n n 
Alg = Allyl = [Ayal | >> asus =) ag lay) = > agen 
j=l j=l j=l 


In vectors, this is equivalent to Ay > |Aly. Together with the definition of Ao, |A| < Ao. 

Finally, consider x; > 0 that is linearly independent of xp and satisfies Ax; = 9x1. Then, w = axg+x, 
is an eigenvector of Ag for any a, but there exists a negative value for a such that w > 0 but w 7 0. This 
is contradictory, as we already showed above that the eigenvector of Ao is positive. Therefore, There is a 
unique eigenvector associated with Xo. 


When a Markov matrix A is positive, there is a positive left-eigenvector associated with the eigenvalue 
of 1, according to Theorem 10.6. Once we normalize this vector (to sum to 1), we get the stationary 
distribution. If A has a 0 entry, we cannot apply Theorem 10.6, but we can use Theorem 10.8 to show 


that it also has the stationary distribution. 


4Think why we can use max instead of sup. 


10.6. Eigenvalue Adjustments and Applications 197 


10.6 Eigenvalue Adjustments and Applications 


According to the Brauer theorem here, when we add a rank-one matrix vw" to a matrix, which has an 
eigenpair (A,v), only À changes to A + w'v, while all the other eigenvalues are maintained. This is a 
useful result when we want to adjust only one particular eigenvalue, and this had been used to improve 


the convergence rate of Google’s PageRank. 


Theorem 10.7 (Brauer) Let (\,v) be an eigenpair of ann x n matrix A and X,A2,...,An tts 


eigenvalues. For any w € C”, the eigenvalues of A+vw" are A+w"v, X2,..., An, and (A+w"v, v) 


is an eigenpair of A+ vw". 


u = RY is a unit vector constructed from v. Then, (\,u) is also an eigenpair of A. By Schur 


triangularization from Theorem 10.5, we know the following holds with a unitary matrix Q = [u| Qə]: 


à a 
QYUAQ =U = , where U,_; is an upper-triangular matrix with A2,...,A, on its diagonal. 
n—1 
Because 
H, H H H uly H H 
Q vw Q = (Q"v)(w'Q) = [wu] w'Q2] 
Qv 
|v] wy b” 
= [w"u| w" Q2] = ; 
0 0 0 
we arrive at 
A\ a wily bi Atwty (a+b)H 
QP(A+vw")Q = + = — 
0/ Un-1 0 0 0 Uasi 
The eigenvalues of the final matrix are A+w"v, A2, . . . , An, and so are the eigenvalues of its similar matrix 


A+vw, Furthermore, (A +vw")v = Av + vw"v = (A + w"'v)v, which completes the proof. 
E 


Using Theorem 10.7, we can also change the rest of the eigenvalue of a matrix. 


Corollary 10.5 Let (\,v) be an eigenpair of n x n matrix A and let ,2,...,An be its eigenvalues. 
Let w € C” be such that wv = 1 and let r € C. Then the eigenvalues of A, = TA + (1 — T)àvw" are 
A, TAQ, ..+,TAn- 


Proof: The eigenvalue of TA are 7A, TA2,---,TAn- By Theorem 10.7, the eigenvalues of A, = TA + 
v((1— 7)Xw)" are TA + ((1— 7)Aw) "v = TÀ + (1 — T)àwHv = TÀ + (1 — T)À =X and TAg,...,TAn- 

E 

Now we present the Perron-Frobenius theorem for a non-negative matrix, expanding Theorem 10.6 


from earlier. 


198 Chapter 10. Advanced Results in Linear Algebra 


Theorem 10.8 (Perron-Frobenius) Any Markov matrix has a stationary distribution. 


Proof: Let A be a Markov matrix. Theorem 10.7 states that the eigenvalues of A are 1, A2,... and Ay 
with |\;| < 1 for all 7 if A > 0. Furthermore, in that case, Theorem 10.6 shows that there is a unique 
stationary distribution. If A has zeroes, we cannot use Theorem 10.6. Instead, we convert A into a 
positive matrix using Theorem 10.7. If we apply Theorem 10.7 with v = 1 and w = el, where e is a 
small positive number, the eigenvalues of the positive matrix A+e¢11' are 1+ ne, à2,..., An. According 
to Theorem 10.6, there is a unique unit left-eigenvector u. associated with the eigenvalue 1 + ne for the 
positive matrix A+ 11! for each e, and this eigenvector is positive. Since {x € R” : |x| = 1} is compact, 


a subsequence ue, of a sequence ue on this sphere converges to a vector on this sphere as k — oo. Because 


ul, (A+e11') = (1 + nek)ug, 


fpi 


we get 


u'A=ul 


in the limit of k + oo. Then, u = limg_.. Ue, = 0, which completes the proof. 
E 


Example 10.5 [Google Matrix] Assume a large Markov matrix A that represents the connectivity 
among the web pages on the internet. A core idea behind Google Search is to rank these web pages 
according to the left eigenvector of this matrix A. When a Markov matrix is large, it is usual in practice 
to use an iterative algorithm, such as power iteration. Such an iterative algorithm converges fast when 
the difference between the first two largest eigenvalues is large. For the computational efficiency, we 
modify A such that the only eigenvalue whose absolute value is 1 is 1 and the absolute values of all the 
other eigenvalues are less than 1, while ensuring the modified matrix continues to be a Markov matrix. 
Let v = 1, w = 41, E = 11' and 0 < 7T < 1. Note that w! v = 1. Compute the weighted sum of a 


rank-one matrix, vw! to A with 7 and 1 — 7 as their coefficients, respectively, and we get 


l-r 


A, =TA+(1—r)vw! =TA+ E. (10.5) 


n 


The eigenvalues of this matrix A; are 1,TA2,...,7An according to Corollary 10.5. In other words, all 
eigenvalues except for 1 have been shrunk by the factor of r. This weighted sum maintains the row sum 
to be 1, because A4,1 = 7A1 + (1 —7)14111 = 71 + (1—7)1 = 1, and all the entries of this matrix 
continue to be positive. We refer to A, a Google matrix, as it was used to rank web pages for Google 
Search in early years. The founders of Google used 7 = 0.85 then. 


We refer readers to Section 9.7 for details on power iteration. 


Chapter 11 


Big Theorems in Linear Algebra 


We can define a monomial of a square matrix A by replacing x in p(x) = cx” with A, such that p(A) = 


cA". More generally, we can think of replacing x with a square matrix A in a polynomial p(x) = 


Cn x” + Enya" 1 +++» +e," + co. By defining A’ to be an identity matrix and treating co as coz?, we 


get p(A) = cn A" + cn_1A™ 1 +--+ +c,A4+ col. Let us use p(x) to denote the characteristic polynomial 
of a matrix A, which may have complex coefficients. The two most abstract and general results in linear 
algebra are the Cayley-Hamilton theorem and the Jordan normal form theorem, both of which work with 
a characteristic polynomial of a matrix. According to the Cayley-Hamilton theorem, p4(A) = 0 for any 
A, and using the Jordan normal form theorem, we can show that any matrix is similar to either a diagonal 


matrix or a Jordan form matrix. 


11.1 The First Big Theorem: Cayley-Hamilton Theorem 


In this section, we present the Cayley-Hamilton theorem which states that p4(A) = 0 for any square 
matrix A when pa(z) is its characteristic polynomial. We use Schur triangularization from Theorem 10.5 


to prove it. 


Theorem 11.1 (Cayley-Hamilton) Let 


pa(«) = det(A — 21) = (—1)"2" e ) +--+ e,4 +c 


be the characteristic polynomial of an n x n matrix A. Then 


pa(A) = (-1)"A" + cn At +--+: + A+ col, = 0. 


Proof: Let the eigenvalues of A be 1, À2,..., An. We can then write its characteristic polynomial as 
pale) = (A1 — £)(A2 — £) ++ (An — 2). 


199 


200 Chapter 11. Big Theorems in Linear Algebra 


According to the Schur triangularization (Theorem 10.5), we can rewrite A as A = QU Q" with a uni- 
tary matrix Q and an upper-triangular matrix U with 1, A2, ..., An on its diagonal. Since p4(A) = 
Qpa(U)Q", all we need is to show that p4(U) = (—1)"(U — MI)(U — Aol) +» (U — AnI) = 0. 
Por 9 = 1,2) 2. «9%, Seb 
U; = (U — MA DU — à21) +- (U — DGD. 


U — Ail is an upper-triangular matrix with its null i-th diagonal, that is, (U — A;J);; = 0. For instance, 
the first column of U; = U — AJ is an all-zero vector. Similarly, the first two columns of Uz = (U — 
Ail)(U — X2I) are all zeros, which can be checked without difficulty. 

Based on this observation, assume the first j — 1 columns of U;_, are all zeros. Given U; = Uj- (U — 
A, 1), each element in the j-th column of U; is determined as the inner product between the corresponding 
row of U;_; and the j-th column of U — AjI. Since the j-th element in the j-th column of U — AjJ is 
zero, only the first j — 1 elements of the j-th column can be non-zero. By the assumption, the first j — 1 
elements of every row of U;_, are all zeros, and hence the inner products of the rows of Uj—ı and the 
j-th column of U — AjI result in all zeros. In other words, the j-th column of U; is an all-zero vector. 
Therefore, the first j columns of Uj are all zeros, which implies that p4(U) = (—1)" Un = 0. 

a 

Although the Cayley-Hamilton theorem is frequently used in theoretical derivations and thought 
experiments, it is not practically used due to the high computational complexity involved in identifying 
the coefficients of the characteristic polynomial. Let us consider one interesting consequence of this 
theorem on the polynomial expression of an inverse matrix. When the characteristic polynomial of an 
invertible matrix A is (—1)"x" + cp_1@"" 1 +--+ +1” + co, the Cayley-Hamilton theorem states that 
(—1)"A" + cn-1 A" 1 +- + GA + col, = 0. After multiplying both sides with A~! and re-arranging 
terms, we arrive at 


co. An-2 cl? A”! 
Co Co Co Co l 


This allows us to express the inverse of A as a linear combination of the powers of A, although it is a 


hurdle in practice to figure out c;’s. 


11.2 Decomposition of Nilpotency into Cyclic Subspaces 


Nilpotency of a Matrix 


A square matrix A is nilpotent of degree r if A"! 4 0 but A” = 0 for a natural number r. More generally, 
given a subspace W, we say “A is nilpotent of degree r on W”, if A’v = 0 for all v € W but A”-!w 40 
for some w € W. Let us study an important example of such W given a nilpotent matrix A. 


Let us construct the following set V4 given an n-dimensional vector space V and an n x n matrix A: 


Va={veV: Av=0 for some j}. (11.1) 


11.2. Decomposition of Nilpotency into Cyclic Subspaces 201 


Note that Av = AĴI: AJ: = 0 for all j > jı if Afv = 0 for some jı. This implies that if Avı = 0 and 
Al2v3 = 0, then, A™*t1 32} (vy, + yg) = AMA dJaty, + Amati d2}y, = 0. That is, vı + v2 € V4 for any 
vı, V2 € Va, which says that V4 is a subspace of Y. 

Let {v,,..., vz} be a basis of V4. Each v; is in V4 and there exists r; with which A’‘v; = 0 for all 
i. Set ro = max{r1,..., rk}. For any r > ro, A’v; = 0 for all i. Because we can express any vector v in 
Va as v = 21v, +--+ + EkVk, ATV = 21 ATV, +--+ + 0, AT Vy, = 0, and hence Vy C Null (A”). From the 
definition of V 4 in (11.1), it is clear that V4 D Null (A"), and therefore, we arrive at 


Va = Null (A”) for some r. 


Assume v € Null (A”) Ñ Col(A"). Because v € Col(A"), there exists w € V such that v = A’w. 
Also, because v € Null(A"), A"v = A?"w = 0. According to (11.1), w € V4 and v = A’w = 0, which 
tells us that Null (A”) N Col(A") = {0}. In other words, two subspaces, Null (A”) and Col (A”), are 
mutually independent. That is, xı = z2 = 0 when z1vı + £2V2 = 0 for vı € Null (A”) and vg € Col (A’), 
because zıvı = —Z2V2 € Null (A”) N Col(A”) = {0}. Since the bases of these two subspaces are linearly 
independent, dim ( Null (A”)+Col (A”)) = dim Null (A”)+dim Col (A”), and dim ( Null (A”)+Col (A”)) = 
n = dim Y according to the rank-nullity theorem (see Theorem 3.4). There is hence a natural number r 
given a square matrix A such that 


V = Null (A") @ Col (A”), (11.2) 


where © is a direct sum introduced in Definition 3.3. 

These two subspaces, Null (A”) and Col (A”), are invariant under A. For v € Null (A”), Av € Null (A”) 
since A’(Av) = A™t!v = A(A’v) = 0. For v € Col(A”), Av € Col (A") since v = A’w for some w € V 
and thus Av = A’(Aw). Furthermore, it is easy to see that these two subspaces are invariant under the 
addition of aI for a scalar a. That is, Null (A”) and Col(A") are invariant under A + aI. Combining 


these observations, we arrive at the following theorem. 
Theorem 11.2 Let A be ann x n matriz and Y be an n-dimensional vector space. Let 
Va={veV:Alv=0 for some j}. 
Then there exists ro such that for any r > ro 
Va = Null (A’) 


and 


YV =Null(A") 6 Col (A’), 


where Null(A") and Col (A") are invariant under A+ al for any scalar a. 


Let us think of a procedure to find linearly independent vectors in V4. Consider a set of vectors 


{vi,...,vx%} recursively generated starting from a vector vı € Vy such that vo = Av; Æ 0,...,Vk = 


202 Chapter 11. Big Theorems in Linear Algebra 


Av;,-1 = A®-!v, Æ 0 and A*v, = Av; = 0. In order to check their linear independence, assume 
1V, +++» + 2~V_, = 0. For v; with i > 2, A*-!v,; = A*-1 At 1y, = A*-?(A*v,) = 0 if we multiply 
both sides with A*~!, leaving only x, A*~!v, = zıv = 0. x, is thus 0. We repeat the same procedure 
by multiplying both sides of zav2 +--+: +a,%v_ = 0 with A*~?, and we see that 22 = 0. By repeatedly 
applying this procedure, we get x; = --- = x, = 0, and therefore, B = {v,, Avi,..., A*~ v1} is linearly 


independent. 


Lemma 11.1 Assume that a nonzero vector v € V4 satisfies Av Æ 0,...,A*-!v 4 0 and A*v = 0. 


Then, {v, Av,..., A®~!v} is linearly independent. 


We say that the basic vectors in Lemma 11.1 exhibit a cyclic structure, and this structure plays a 
crucial role in decomposing a matrix into a Jordan form later. This lemma also says that the nilpotent 
degree r can not exceed n, the dimension of the underlying vector spaces, R” or C” as well as that we 


can set ro < n in Theorem 11.2. 


Direct Sum Decomposition of the Null Space of Nilpotent Matrices 


For W = Null A’, we will analyze a nilpotent matrix A of degree r on W. Null AF C Null A*+! since 


AFv = 0 implies A**+!v = 0, and thus, 
Null A C Null A? c--- C Null A"! c Null A” = W. (11.3) 


By the definition of nilpotency, there exists at least one vector w € W such that A’~'w Æ 0. For this 
w, A’-*w € Null A* \ Null A*~?, and hence the subset inclusions in (11.3) are strict. We now use this 
nilpotency structure of A to decompose the subspace W = Null A” as direct sums. 


We first extend the notion of linear independence. 


Definition 11.1 Vectors v1,...,v,% are linearly independent of a subspace W if avı +---+apv, €E W 


implies ay =- = ak = 0. 


When v1,..., Vg are linearly independent of W, they are also linearly independent since 0 € W. We 
use this extended notion of linear independence to form a basis of Null A” by finding vectors that are 
linearly independent of Null A*~! and are included in Null A* \ Null A*7?. 


Theorem 11.3 Let A be a square nilpotent matrix of degree r on Null A". Then, there exist m and 
vectors V1,-..;Vm E Null A” such that the non-zero vectors Afv, for j > 0 and1<<™m, form a basis 


for Null A”. Any vector linearly independent of Null A’~! can be included in {v1,...,Vm}- 


Proof: Ifa matrix A is nilpotent of degree r on Null A”, then there exists a vector v such that A’v = 0 
as well as Afv Æ 0 for j < r. The vector w = Av satisfies A"~'w = A’v = 0 and Afw = AJ*!v Æ 0 for 
j <r-— 1, which implies that A is nilpotent of degree r — 1 on Null A’~!. This observation enables us to 


use mathematical induction to prove this theorem: 


11.2. Decomposition of Nilpotency into Cyclic Subspaces 203 


e When r = 1, dim Null A > 1. We simply find a basis of Null A. 


e Assume the theorem holds up to r— 1. Let A be a nilpotent matrix of degree r on Null A’, 
and w1,...,wz be non-zero vectors in Null A” that are linearly independent of Null A’~!. These 
vectors are obtained maximally such that an addition of any other vectors in Null A” \ Null A’~! to 
{w1,...,W,} causes linear dependence.! Because these vectors w; are in Null A", Aw; € Null A™?. 
If we suppose a, Aw, +-::+a,Aw;, € Null A’~?, it must be a; = --: = ay = 0 because w;’s are 
linearly independent of Null A’~! and ayw, +-:-+a,w, € Null A’! from A(aywi +: -+apkwpk) € 
Null A’~?. Therefore, Aw,,..., Aw, are linearly independent of Null A’~?. Since A is a nilpotent 
matrix of degree r — 1 on Null A’~!, by the induction hypothesis, there exist v1,...,Vm— in 
Null A’~!, including Aw,,...,Aw,, and A’v,’s form a basis of Null A’~!. We can then form a 


basis of Null A” by combining the original basis of Null A”~t and {wi,..., wx}. 


This completes the proof for all r. E 
Let us inspect the basis obtained by Theorem 11.3 more carefully. For each vz, we define re to 
satisfy Av; = 0 and A™-tv; Æ 0. With ry’s, we can rewrite {Afv : j > 0,1 < L < m} = 


UR {ve,..., A” tve}. If we set Ve = span{vy,..., A’’~'ve}, we can decompose Null A” as 
Null A” =Vi0-:-®Vm, re=dimV,. (11.4) 


Furthermore, each Yọ is invariant to A, since v € Vz implies Av € Vz. Even though Vy is not determined 
uniquely as v1,...,Vm are not unique, the dimensions of V,’s are uniquely determined up to the order 
of V¢’s in the direct sum. See Table 11.1 for an example. 

Set dy = dim Null A* — dim Null A*~! for k =1,...,r. dy > 1 since Null A* \ Null A*-! 4 Ø. Then, 
among the basic vectors in a basis {Afv : j > 0,1 < £ < m} of Null A’, dp of them are included in 
Null A* \ Null A*~?!. If Afv; is one of dp basic vectors in Null A% \ Null A*7!, the basis of V; must 
contain A*-!+Jy,, which implies re > k + j. The dimension of Vọ is thus at least k. Therefore, we 
can see that there are dy-many V,’s of dimension at least k in the decomposition in (11.4). It is also 
easy to see dk+1 < dp. Then, with d,,, = 0, there are (dy — dg+1)-many Vp’s of dimension k. The 
sequence (d,,...,d,) is uniquely determined for a matrix A, and so are the number and dimensions of 
the subspaces in the decomposition (11.4). We often call V¢ a cyclic subspace in order to emphasize the 


cyclic structure of span{vy,..., A™~!ve}. We summarize this observation in the following theorem. 


Theorem 11.4 Let A be a nilpotent square matrix of degree r on Null A”. Then, there exists a unique 
number m and res such that 


Null A” =Vi @---®Vin 


where Ve = span{ve,..., Ave—lyo}. Furthermore, the number of summands Ve of dimension at least k 
is dim Null A* — dim Null A*-!. Therefore, the decomposition is unique up to the number of summands 


and their dimensions. 


lAs a specific example, we can find w1,..., Wẹ in Null A” by applying the Gram-Schmidt procedure on the basis of 
Null A". 


204 Chapter 11. Big Theorems in Linear Algebra 


Table 11.1: Demonstration of a Decomposition of Nilpotent Matrix 
k Vi Vo V3 V4 Vs Ve | Vz || dimNull AF | d 
5 V1 V2 25 2 
4 || Avy Avo V3 V4 V5 23 5 
3 || A2v, | Avo | Avg | Av, | Avs 18 5 
2 || Avı | A’və | A?v3 | A2va | A2vs | ve 13 6 
1 || Atv, | Atve | Ava | Av, | Avs; | Ave | v7 7 7 


Consider Table 11.1 demonstrating this decomposition of a nilpotent matrix of degree 5. Each column 
in this table corresponds to the basis of Vz, and if we combine vectors in the bottom-k rows, we get a 
basis of Null A¥. For instance, we form a basis of Null A? by combining vectors in the rows of k = 1,2, 
and 3. Specifically, vectors in the row of k = 1 are the eigenvectors associated with the eigenvalue 0 of A 
as well as constitute a basis of Null A. 

We can apply this result on nilpotency to A — AI where A is an eigenvalue of A. In the next section, 
we show that A — AJ is a nilpotent matrix of degree r on Null (A — AJ)" for some r < n. We then obtain 
a Jordan block using Theorem 11.3 and 11.4. 


11.3 Nilpotency of A— AI 


Generalized Eigenvectors 


Given an eigenvalue À of an n x n matrix A, all the vectors in Null (A — XJ) are its eigenvectors, and the 
dimension of this null space is the geometric multiplicity. If the geometric multiplicity is smaller than 
the algebraic multiplicity of A, we run into an issue when diagonalizing A. Before studying it further, we 


derive the following corollary by applying Theorem 11.2 to A — XI. 


Corollary 11.1 Let A be ann xn complex matrix with an eigenvalue A. Then, there exists rà such that 
for anyr > ry 


{ve V:(A—AIv=0 for some j} = Null (A — AI)” 


and 


C” = Null (A — AZ)" @ Col (A — AI)", 


where Null (A — AI)" and Col (A — AL)” are invariant under A — cI for any scalar c. 


Note that r, < n since the number of basic vectors in the basis of Null (A—AJ)™ is at most n. Therefore, 


Null (A — AD)™ = Null (A — AI)” holds and we investigate the nilpotent structure of A — AI through 


11.3. Nilpotency of A — AI 205 


Null (A — AI)”. As a work-around when the geometric multiplicity is smaller than the algebraic multi- 
plicity, we call the vectors in Null (A — AI)” generalized eigenvectors, and find a simple matrix similar 
to the original matrix by representing with respect to a new basis consisting of generalized eigenvectors. 
Interestingly, the only eigenvalue of A on Null (A — AI)” is A itself. Let (u,v) with v € Null (A — AI)” 
is an eigenpair of A. Then, (A — AI)v = Av — Av = (u — A)v and 0 = (A — AI)"v = (u — A)” Vv, which 
implies u = A. Furthermore, the generalized eigenvectors corresponding to different eigenvalues are lin- 
early independent as so are the eigenvectors corresponding to different eigenvalues. In other words, for 


different eigenvalues 1 Æ A2, 
Null (A — AyD)" O Null (A — Aol)” = {0}. (11.5) 


Suppose that 0 4 v € Null (A—AiJ)"Null (A— A27)”. Then, there exists k > 0 such that (A—A2I)*tv = 
0 and (A — A2I)*v £0 since v € Null (A — gl)". Set w = (A — A2I)*kv Æ 0 such that Aw = Aw. Since 
the product of A — A,J and A— AI commutes, 


(A ra Mil)" w = (A = AiL)"(A = d2)kv = (A T d2)*(A = Ail)"v = 0, 


which implies w € Null (A — AJ)”. Therefore (Az, w) is an eigenpair of A on Null (A — àI)”, which 
contradict to the uniqueness of the eigenvalue of A on Null (A — àı 7)”, and (11.5) holds. 

An n x n complex matrix A has n eigenvalues, including multiple roots, and is similar to an upper 
triangular matrix with the eigenvalues on its diagonal, according to Theorem 10.5. That is, A = QUQ" 
for some upper triangular matrix U and unitary matrix Q. Assume m distinct eigenvalues, \1,...,Am, 
and the same eigenvalues are located adjacently on diagonals of U. If the algebraic multiplicity of A; 
is kj, the k; diagonal entries in the i-th diagonal block of U are all \;. Hence, the i-th diagonal block 
of U — Ail is a (ki x ki) upper-triangular matrix with all zero diagonal entries. In addition, for any k, 
(A — AiD)! = Q(U — A; T)EQ" since 

A — XI = QUQ" — QQ" = QU — AD)O". 
According to Fact 2.1, all entries of the i-th (k; x ki) diagonal block of (U — \;J)* are all zeros and 
the remaining diagonal entries of (U — \;I)* are (Aj — Xi)" # 0. The rank of (U — A;T)* is thus 
n — ki, and so is that of (A — \,J)*, since Q is invertible. This results in dim Null (A — A,J)** = 


dim Null (U—);,I)** = k;. Since k;’s are the multiplicities of multiple roots of the n-th order characteristic 


equation and Null (A — \;J)*' C Null (A — A;T)” holds, 
dim(C”) = n ky +e tkm 


= dimNull(A-— AJ)" +---+dim Null (A — Am1 )"” 


< dimNul(A-— AT)” +--+ dim Null (A — ,,J)”. 
On the other hand, each Null (A — A;)” C C” implies 


C” > Null (A — A17)” +- + Null (A — Amt)”. 


206 Chapter 11. Big Theorems in Linear Algebra 


Combined with (11.5), we get an inclusion in terms of direct sums as 
C” > Null (A — MZ)” @---@ Null (A — Amt)”, 
which implies n > dim Null (A — A,J)" +- -- + dim Null (A — AmI)”. Combining these inequalities, we get 
n = kı +--+ + km < dim Null (A — A.J)” +---+ dim Null (A — AmI)” < n. 


Then, k; < dim Null (A—,J)” implies k; = dim Null (A—,J)”, and the above set inclusion is an equality 


in fact. That is, we can decompose C” as 
C = Null(A—- A)" e- Nul (A -— àmI)” 
We summarize this observation into the following theorem. 
Theorem 11.5 Let A be ann x n matrix, and \1,..., Am its eigenvalues. Then, 
C” = Nul (A — àI)” @--- Nul (A — AmI)”. (11.6) 


Once we obtain a basis of each subspace Null (A — A;J)”", we form a basis of C” by combining them. 


11.4 The Second Big Theorem: the Jordan Normal Form The- 


orem 


Assume åA; is an eigenvalue of an n x n complex matrix A with its algebraic multiplicity k;. From Corollary 
11.1, A—A,J is a nilpotent matrix on Null (A — à:J)”, which allows us to apply Theorem 11.3 and 11.4 
to the nilpotent matrix A — A;,J. Null (A — \;J)" is decomposed as the direct sum of the subspaces with 
bases {ve,...,(A — AyZ)"~!ve}, according to Theorem 11.3. With wẹ = (A — A;J)*—1v¢, the basis is 


{wi,...,W,,}, and the basic vectors are related to each other by 
Wroi = (A — à;I)wk fork=1,...,re-—1, (A-AL)w,, =0. 
We can rewrite it as 
Aw, = A\;We+Wey1 fork=1,...,rg—1, Aw,, = A;w,,. 


Let us define a Jordan block Je as the following rz x re matrix: 


Ài 1 0 h 0 
0 à 1 
Jese sey th a i = Ail +[Oler|--- ler]. 
Ài 
0 0 A; 


11.4. The Second Big Theorem: the Jordan Normal Form Theorem 207 


When W, = | wr | zr |wi],? 
AW, = Wed). (11.7) 


In other words, linear transformation, defined by A, in a subspace with a basis {w1,...,w,r,}, can be 
expressed using a Jordan block Jz. For each vz, there exists a Jordan block Je that satisfies (11.7). Hence, 
for Wy, = [Wi | ---|W,,] and Jy, = diag(Ji,..., Ire), 


AW, = Wy, Jya, (11.8) 


where W), and Jy, are n x k; and k; x k; matrices, respectively. 

The sizes and number of Jordan blocks that constitute J), are uniquely determined by Theorem 11.4. 
There exist W), and J), that satisfy (11.8), for each eigenvalue A; for A. If we let W = [Wy | -| Wan] 
and J = diag(Jy,,.-.,Jy,,); 

AW = WJ, (11.9) 


and we call J a Jordan (normal) form. Since W, which takes basic vectors as its columns, is invertible, 
J =W~'AW, meaning that A and J are similar. 


We summarize this result as the following theorem. 


Theorem 11.6 (Jordan Normal Form Theorem) Any n x n matrix is similar to a Jordan normal 


form. The Jordan form is unique up to the number and sizes of Jordan blocks. 


A Jordan form J is not a diagonal matrix but closely resembles it. Using this special structure, we 
can often compute J* more efficiently than A*, by replacing the diagonal matrix in power iteration, from 
Section 9.7, with such a Jordan block. It is less efficient than using a diagonal matrix but is often more 


efficient than using the original matrix directly. 


2Be careful with the column indices, as they are flipped, because it is a convention to define a Jordan block as an 


upper-triangular matrix with the super-diagonal set to 1. 


208 


Chapter 11. Big Theorems in Linear Algebra 


Chapter 12 


Homework Assignments 


Chapter 2 


1. Consider the following simultaneous linear equations in 3 unknowns, x = (u,v, w)!: 


u —4v 47w = -9 
2u —6v 4+9w = -10 
u —2v +5w = -T7 


(a) Convert the equations in matrix-vector form as Ax = b. What are A and b? 


(b) Decompose A in the form of LDU where L and U are lower and upper triangular with unit 


diagonals, and D is a diagonal matrix. What are L, D, and U ? 


(c) Find x by solving DUx = L~'b. 
(d) Compute A}. 


2. Repeat 1 with randomly generated 10 x 10 matrix A and b € R?°. 


3. Prove that matrix multiplication is associative (AB)C = A(BC) and distributive A(B + C) = 


AB + AC, (B+C)D = BD +CD. 


4. Find an example showing that matrix multiplication is not commutative, that is, find two matrices 


A and B such that AB 4 BA 


5. Show that AI, = A =J,,A for any m x n matrix A. 


6. Show that A' DA is symmetric where A is m x n and D is an m x m diagonal matrix. 


7. Show the following multiplication rule for two block matrices: 


AB = An Ai} |Bu Bis} — | Au Bi + AiB 
Az, Ago} |Bo1 Boo Ag; By, + A22B21 


209 


Aji B124 
Ao B124 


+ Ajo Boo 
+ Ago B22 


210 Chapter 12. Homework Assignments 


8. Show that the product of two lower triangular matrices is also lower triangular. 


9. Compute A?, A3, and Af where 


012 3 

0 0 1 -i 
A= 

000 1 

000 0 


Chapter 3 


1. Show that {0}, {cx : c € R} for x € Y, and {c1X1 +--+ + CnXn : C1,---;Cn E€ R} for x1,...,x, E€ V 


are subspaces of a vector space Y. 
2. Show that {(x,y) : x > 0,y > 0} is not a subspace of R?. 


3. Show that the set of all n x n lower triangular matrices is a vector space and a subspace of the 
vector space consisting of all n x n matrices. Do the same work for the set of all n x n symmetric 


matrices. 


4. Let U be the row echelon form of A. Show that the vectors, in the null space of U, obtained by 
setting 1 for a single free variable and other free variables as zeros constitute a basis for the null 


space of A. 


5. Consider the following 4 x 5 matrix A: 


2 2 1 -6 4 

4 4 1 10 13 
A = 

6 6 0 20 19 

8 8 1 14 23 


(a) Compute the row echelon form and reduced row echelon form of A. 
(b) What is the rank of A? 
(c) Characterize the null space of A. 
(d) What is the dimension of Null (A)? 
(e) Find a maximally independent set of column vectors of A. 
(£ What is the dimension of Col (A)? 
6. Let a set V include all polynomials of degree n or less, V = {ap+ait+---+ayt” : a0, 41,.--,4n E R}. 


Let T be a transform that maps a polynomial f(t) to the polynomial fe f(s)ds. Note that the n is 


a fixed integer in this question. 


(a) Show that V is a vector space over the multiplication scalar field R. 


211 


(b) What is the dimension of V? 

(c) To make T be a map from V into W, what should be W if W is a vector space? 

(d) Is T a linear map? 

(e) Find bases By and By of Y and W. 

(f) Find the transform matrix A representing T with respect to the bases By and By chosen 


above. 


7. In a finite-dimensional vector space, show that any linearly independent set of vectors can be 


extended to a basis. 


8. Let V be a vector space of dimension n. Wy, and Wə are two subspaces of V. Suppose that 


dim Wı = nı, dim W2 = ng, and nı + ng > n. Show that dim(W, N We) > nit no-—n. 


Chapter 4 


1. Let a vector space V include all polynomials of degrees 2 or less, V = {ag +a t+agt? : ap, a1, a2 € R}. 


For two polynomials f(t) and g(t) in V, we define an inner product 


(£9) = 4 f(t)g(t)dt. 


(a) Compute |f], |g], and |f — g| for f(t) = t and g(t) = t°. 
(b) Show that {1,t,¢?} is linearly independent. 
(c) For W = span{1,t?}, compute Pw/(t). 
(d) Show that B = {1,t,t?} is a basis for V. 
(e) Find a matrix representation of the inner product (f,g) with respect to the basis B. 
2. Consider the Euclidean vector space R*. For two vectors x = (£1, £2, £3, £4)! and y = (y1, Y2, Y3, ya) | 


in R4, we define a non-standard inner product 


2 2 2 
(x,y) = 2r1y1 + 3 (714s + ayo + 2341) 4 5 (xoya + 23y3 + Gayo) + gT, 


or, in matrix form as 


2 0 2 0 yı 
0 2 0 2 

(x,y) a [£1, 22,23, £4] 2 : 2 5 a 
3 0 5 0 Y3 
0 2 0 2 YA 


(a) Compute |x|, |y], and |x — y| for x = (1,0,0,0)' and y = (0,0,1,0)". 


(b) Find an orthonormal basis of span{(1,0,0,0)", (0,0, 1,0) '}. 


212 Chapter 12. Homework Assignments 


(c) Let W = span{(1,0,0,0)", (0,0,1,0)! }. Find an orthonormal basis of Wt. 


3. Let a vector space V include all polynomials of degree 3 or less, V = {ao + ait + agt? + agt? : 


ao, 41, 42, a3 E€ R}. For two polynomials f(t), g(t) € V, we define an inner product 


(f.9) = L SOA. 


(a) Compute |f|, |g], and |f — g| for f(t) = 1 and g(t) = £. 
(b) Find an orthonormal basis of span{1, t?}. 
(c) Let W = span{1,t?}. Find an orthonormal basis of W+. 


(d) Find the matrix representation of the inner product (f, g) with respect to the basis {1, t, t?, t°} 
of V. 


4. Compare 2 and 3. 


Chapter 5 


1. Find singular values and singular vectors of the following rotation matrix: 


cos —sin@ 


sin  cos@ 


2. Let a 4 x 4 matrix A is given as 


lo o 2 of -3 


O O m.e me 


(a) Find singular values and right-/left-singular vectors of A. 
(b) Find || Allz and || All. 

(c) Find the pseudoinverse At of A. 

(d) Find a 4 x 4 matrix B of rank 2 that minimizes ||A — B||2. 


(e) Can you find eigenvalues and eigenvectors of A by hand calculation? 


3. Let a 4 x 4 matrix A is given as 


1 


Jo o 2 of -3 h 10 of -2/7 [i 1 0 o; 


oF Oo oO 
O O m.e me 


0 


213 


(a) Find singular values and right-/left-singular vectors of A. 
(b) Can you find eigenvalues and eigenvectors of A by hand calculation? 
(c) Find rank A. 
2 1 -1 1 
4. Suppose A is a 3 x 4 matrix given as A = 6 3 =3 3 
—2 -1 1 -1 
(a) Find eigenvalues and eigenvectors of A! A. 
(b) Find eigenvalues and eigenvectors of AA! . 


(c) Is there a 3 x 4 matrix B of rank 2 that minimizes ||A — Bl|2? 


Chapter 7 


1. Let A and B be two n x n positive definite matrices. Consider the following sum of two quadratic 


forms 


f(x) = (x—a)' A(x — a) + (x—b)' B(x — b) 


where a,b € R”. Re-formulate the sum as a single quadratic form: that is, find C, d, and r such 
that 
f(x) = (x-d)'C(x-d)+r. 


2. Let A be an n x n symmetric positive definite matrix. Denote its eigenvalues as \1,...,An. Let B 
be the square root of A, that is, A = B?. 
(a) Let A=U'U be the Cholesky factorization of A. Find | det U|. 
(b) Find the eigenvalues of B. 
(c) Let B = QR be the QR-decomposition of B. Find | det R]. 
(d) Does there exist the square root C of B? If yes, what are the eigenvalues of C? 

3. Let A be an n x n symmetric positive definite matrix whose spectral decomposition is described as 
A=VAV' where V is an orthogonal matrix and A = diag(\1,...,An) for A; > 0. We also define 
ATE = diag(At/*,..., AYP). 

(a) Characterize a symmetric positive definite matrix Bo satisfying A = B2 in terms of V and A. 


(b) Characterize a symmetric positive definite matrix Bẹ satisfying A = BK in terms of V and A 


for every positive integer k. 


(c) What would be limz_,.. Bk? Validate your answer as reasonably as possible. 


214 Chapter 12. Homework Assignments 


Chapter 8 


1. Compute the determinants of A, U, U', U~!, and M where 


i 4 4 8 7 0 0 0 8 
012 2 0 0 2 6 
A=|5||3 -2 a], v= , M= 
5 002 6 012 2 
0 0 0 8 4 4 8 7 
2. Let A be an n x n tridiagonal matrix of 
1 -1 
1 1 -1 
1 1 -1 
A= 
1 1-1 
1 1 


Find det A. 


3. Find the determinant of the following Vandermonde matrix: 


2 


V3=]1 b BP 


1 ce Ê 


4. Find the determinant of the following rotation matrix: 


sinf coso sin@sing cosé 
cos@cosd cos@sing —sin#é 
— sin cos @ 0 


Chapter 9 


1. Show that the set of eigenvectors associated with a single eigenvalue is a subspace without the 


origin. 


2. Find all eigenvalues of A, B, U, U~', and T where 


1 1 448 1 -1 0 
e a eS 2) ao Soe ey a ae a 
1 1 002 01 1 


3. Find all eigenvectors of A in 2. 


4. Find all eigenvalues and eigenvectors of the following rotation matrix: 


cos@ —siné 


sin  cos@ 


215 


216 


Chapter 12. Homework Assignments 


Chapter 13 


Problems 


13.1 Problems for Chapter 1 ~ 4 


Problem Set 1 
1. Suppose A is a 3 x 4 matrix and b is a vector in R3. 


(a) Give an example of the matrix A with rank 2. No component of A is allowed to be zero and 
1 0 0 

PA admits an LU decomposition with P= | 0 0 1 | whereas A itself does not allow LU 
0 1 0 


decomposition. 


(b) For the matrix A in (a), give two examples of b, one of which allows a solution of Ax = b and 


the other one doesn’t. 


(c) Find two vectors in Col (A) closest to the two vectors in (b), respectively. 
2. Suppose A and B are n x m and m x n matrices, respectively. 


(a) Recalling that each column of AB is a linear combination of the columns of A, prove that 
rank(AB) < rank(A). Can we conclude rank(AB) < rank(B) by straightforwardly applying 
the above statement? If yes, why? 

(b) Assume AB = I, where I, is the identity matrix. What is the rank of A? If m = n, what is 
BA? 

(c) Assume AB = I, where I is the identity matrix and m > n. Does Ax = b have a solution for 


any b € R”? Does By = c have a solution for any c € R™? 


3. Let a vector space V include all polynomials of degree 3 or less, V = {ao + ait + agt? + agt? : 


ao, 41,42,43 E R}. Let T be a transformation that maps a polynomial f(t) to the derivative f’(t). 


217 


218 Chapter 13. Problems 


(a) To make T a map from V into W, what should be W if W is a vector space? 
(b) Find bases By and By of V and W. 


(c) Find the matrix A representing the transform T with respect to the bases By and Bw chosen 


above. 


4. Let a vector space V include all polynomials of degree 3 or less, V = {ag + aıt + aot? + azt? : 


ao, @1, 42, a3 E€ R}. For two polynomials f(t), g(t) € V, we define an inner product 


(f.9) = J SODA. 


(a) Compute |f], |g|, and |f — g| for f(t) = t and g(t) = t°. 
(b) Find an orthonormal basis of span{1, t?}. 
(c) Let W = span{t?}. Find a basis of W+. 


5. Consider the Euclidean vector space R*. For two vectors x = (£1, £2, £3, £4)! and y = (y1, Y2, Y3, ya) | 


in R4, we define a non-standard inner product 


2 2 2 
(x,y) = 2r1y1 4 3 (7193 + T2Y2 + £3y1) 4 z (T24 + £343 + Layo) 4 aya, 


or, in matrix form as 


2 0 2 0 yı 
0o 2 0 2 y 
(x,y) = [z1, £2, £3, £4] 2 $ 2 : : 
3 0 ¢ 0 Y3 
0 2 0 2 Y4 
(a) Compute |x|, |y|, and |x — y| for x = (0,1,0,0)! and y = (0,0,1,0)'. 


(b) Find an orthonormal basis of span{(1,0,0,0)', (0,0, 1,0)". 
(c) Let W = span{ (0,0, 1,0)" }. Find a basis of W+. 
6. (a) Let a subspace W of RÌ be spanned by two vectors (—1,0,1)' and (1,1,0)'. Find the projec- 
tion of (1,1, 1)7. 


(b) Consider a linear model z = ax+by. We have three observations of (x,y,z): (—1, 1,1), (0,1,1), 


(1,0,1). Compute the least square estimates of a and b. 


(c) Compare the above two questions. 
7. True or False. No need to explain your guesses. 


(a) For an m x n (m > n) matrix, the number of linearly independent rows equals the number of 


linearly independent columns. 


13.1. Problems for Chapter 1 ~ 4 219 


(b) For an n x n matrix A, a map T from the vector space of n x n matrix onto itself, defined as 


T(X) = AX — XA is a linear transformation. 
(c) Suppose that a square matrix A is invertible. Then, the following block matrix B is invertible, 


and the inverse is given as 


A 0 7 At 0 
B= and BUT = 
C I -CA I 


where 0 is a matrix with 0’s and J is an identity matrix in appropriate sizes, respectively. 
(d) A set of linearly independent vectors is orthogonal. 


(e) For a square matrix A, dim Null (A) = dim Null (A! ). 


Problem Set 2 
1. Suppose A is a 3 x 4 matrix and b is a vector in R3. 


(a) Give an example of the matrix A with rank 2. No component of A is allowed to be zero and 
1 0 0 

PA admits an LU decomposition with P= | 0 0 1 | whereas A itself does not allow LU 
0 1 0 


decomposition. 


(b) For the matrix A in (a), give two examples of b, one of which allows a solution of Ax = b and 


the other one doesn’t. 
(c) Find vectors in Col (A) which are closest to the two vectors in (b), respectively. 


2. Assume that u1,..., up and v1,...,v,% are sets of orthonormal vectors in R”, respectively. Find the 


rank of n x n matrix 
k 
> jujvj - 
j=l 


3. Let a vector space V include all functions in the form of f(t) = aye’ + age 7" + age™ where 
a1, 42,a3 € R, that is, V = {aet + age” + aze~** : ay, a2,a3 € R}. For two functions f(t) and 


g(t) in V, we define an inner product 


o= | FOHA. 
0 
(a) Check that V is a vector space over the scalar R. 
(b) Compute |f], |g|, and |f — g| for f(t) = e™* and g(t) =e~*". 
(c) Show that Bı = {e~,e~ 7, e~3"} is linearly independent. 


(d) Find the matrix representation of the inner product (-,-) with respect to the basis By. 


220 Chapter 13. Problems 


(e) For W = span{e~‘, e~*4}, compute Pw(e~”*). 
(f) Find an orthonormal basis By for Y. 
(g) Find the matrix representation of the inner product (-,-) with respect to the basis Bo. 


(h) The density level of some chemical at time t is described by a,e~* + age~?" + age™’t + e for 
some fixed a,,a2,a3 E€ R. At n time points 0 < ty < tg <--- < tn, the observed density 
levels are y1, Y2,---, Yn. For €i = yi — (aye + aze? + age), characterize G1, G2, @3 that 
minimizes $ ;_] €?. 


4. Let V be a vector space and T be a linear transform representing a projection. 


(a) Show that W = {T(v) : v € V} be a subspace of V. 

(b) Show that I — T is also a projection where (I — T)(v) =v — T (v). 
(c) What is the linear transform T o (I — T)? 

(d) Describe W+ in terms of T. 


5. True or False. No need to explain your guesses. 


(a) For an n x n matrix A, I — A has rank n — k if rank A = k. 
(b) For an n x n projection matrix P, I — P has rank n — k if rank P = k. 


(c) For an n x n matrix A, a map T from the vector space of n x n matrix onto itself, defined as 


T(X) = A! X — X'A is a linear transformation. 


(d) Suppose that U and V are n x k matrices. Then 


I, 0| |2,+UV' U In 0 In U 
V! k 0 I,| |-V' k 0 &-V'U 


(e) A set of orthogonal vectors is linearly independent. 


13.2 Problems for Chapter 5 ~ 9 


Problem Set 1 


1. A 3 x 3 square matrix has a QR-type decomposition (not an exact QR-decomposition) as A = 


1 -1 0 21 0 
tye ot 01 0 
1 1/2 1 0 0 =i 


(a) Find the Q and R factors in the QR-decomposition of A. 


(b) Compute the volume of a parallelopiped whose six faces are parallelograms formed by the 


column vectors of A. 


13.2. Problems for Chapter 5 ~ 9 221 


(c) Compute A~!. You may describe the inverse in a decomposed form. 


2. Suppose A isa 3x4 matrix. We computed its singular value decomposition (SVD) using a computer 
program. Unfortunately, there was a problem with the screen, and we could not recognize some 
figures of the SVD results. We have U, V, and © such that A = USV' where 

1/2 1/v2 0 


1/V2 1/v2 0 3 0 > 
1/2 0 * dues 
U= 0 * * |,V= ,u=]0 2 x |, and * means missing 
1/2 * * 
-1/J2 x x 0O *« 1 


figure on the screen. 


(a) Fill out V. 
(b) Find the largest eigenvalue and corresponding eigenvectors of A! A. 
(c) Find a rank 2 matrix B that minimizes ||A — B||2. 
2 1 -1 #1 
3. Suppose A is a 3 x 4 matrix given as A = 4 2 -2 2 
—2 -1 1 -1 
(a) Find a singular value decomposition of A. 
(b) Find the pseudoinverse At of A. 


(c) Find || Allo. 


4. Assume an m x n matrix A = [aj,...,a,] has non-zero columns a; E€ R”, i = 1,...,n that are 
T 


orthogonal to each other: a; a; = 0 for i # j. Find an SVD for A, in terms of a;’s. Be as explicit 


as you can. 


5. Assume an m x n matrix A with m > n, have singular values o1,...,0n. Set an (m+n) x n matrix 
o A 


A= 
In 
(a) Find the singular values of A. 
(b) Find an SVD of the matrix A in terms of the SVD for A. 
6. A= X`; viv] where v;’s are orthonormal, A; > 0, and Ay > Az >... > An: 
(a) Compute det A. 
(b) Find all eigenvalues and corresponding eigenvectors of A. 
(c) For A to be positive semidefinite, what conditions on A; do we need? 
(d) If A is positive semidefinite, characterize an SVD of A. 


(e) If A is positive semidefinite, characterize the pseudoinverse AT of A. 


222 Chapter 13. Problems 


(£) If A is positive definite, characterize the inverse A~ of A. 


7. Let A be an n x n symmetric positive definite matrix whose spectral decomposition is described as 
A=VAV' where V is an orthogonal matrix and A = diag(\1,...,An) for A; > 0. We also define 
ATE = diag(At/*,..., dx/*). 

(a) Characterize a symmetric positive definite matrix Bə satisfying A = B2 in terms of V and A. 


(b) Characterize a symmetric positive definite matrix Bẹ satisfying A = BK in terms of V and A 


for every positive integer k. 


(c) What would be limz_,.. Bk? Validate your answer as reasonably as possible. 


8. True or False. No need to explain your guesses. 


(a) For an orthogonal matrix Q, det Q = 1. 
(b) For n x n symmetric matrices A and B, \,(A+ B) > A,(A) if B is positive definite. 


B 0 
(c) Suppose that B and C are square matrices. Then, the following block matrix A = 
0 C 


is positive definite if and only if both B and C are positive definite. 
(d) For an m x n matrix A, At = (A! A)7!A if rank(A) = m. 


(e) Every projection matrix has 0 as its determinant. 


Problem Set 2 


1. Let a matrix A be m x n and a matrix B be n x m. Consider a block matrix 


Find det M. 


2. Assume that a matrix B has eigenvalues 1, 2, 3, a matrix C has eigenvalues 4, 0, -4, and a matrix 
C 

D has eigenvalues -1, -2, -3. Set a 6 x 6 matrix A = where B,C, and D are 3 x 3 matrices. 
0 D 


We also define a linear transformation £ : R — R® as L(x) = Ax for x € R. 


(a) What are the eigenvalues of the 6 x 6 matrix A? 


(b) Consider the unit cube Q in R®, that is, Q = {(£1,..., £6)! :0 < a < 1,i = 1,...,6} C R. 
Find the volume of £(Q) = {é&(x): x € Q}. 


13.2. Problems for Chapter 5 ~ 9 223 


3. Let a 4 x 4 matrix A is given as 


Jo o 2 o] -3 


O =. O © 
O O m.e me 


(a) Find eigenvalues and eigenvectors of A. 

(b) Find singular values and right-/left-singular vectors of A. 

(c) Find ||All2 and || All. 

(d) Find the pseudoinverse At of A. 

(e) Find a 4 x 4 matrix B of rank 2 that minimizes ||A — Blo. 
2 1 -1 #1 

4. Suppose A is a 3 x 4 matrix given as A = 4 2 —242 

—2 -1 1 -1 

(a) Find eigenvalues and eigenvectors of A'A. 

(b) Find eigenvalues and eigenvectors of AAT. 


(c) Is there a 3 x 4 matrix B of rank 2 that minimizes ||A — Bl|2? 
5. Let A be an m x n matrix. Define its symmetrization s(A) as 


0 A 
s(A) = , 
A’ 0 
which is an (m+n) x (m+n) symmetric matrix as the name suggests. 
(a) Let (o,v,u) be a singular triplet of A. Then, ø and —o are eigenvalues of s(A). Find the 


eigenvectors of s(A) associated with the eigenvalues g and —ø, respectively. 


(b) Let (A, w) be an eigenpair of s( A) where A 4 0. Then, —A is also an eigenvalue of s( A). Find 


the eigenvector of s(A) associated with the eigenvalue — AÀ. 
(c) Let A < 0 and (A, w) be an eigenpair of s(A). Find right-/left-singular vectors of A associated 


with the singular value — A. 


6. Let A be an m x n matrix with at least one non-zero entry. What are the projection Pg.) (4) onto 
Col (A) and the pseudoinverse A* of A if 
(a) A consists of a single column v, that is, n = 1; 
(b) the columns of A are orthonormal; 


(c) the columns of A are linearly independent (that is rank A = n); 


224 Chapter 13. Problems 


(d) the columns of A are possibly linearly dependent, and one of its compact SVD is ULV '. 


7. Let A be an n x n symmetric positive definite matrix. Denote its eigenvalues as \,.. 


be the square root of A, that is, A = B?. 
(a) Let A=U'U be the Cholesky factorization of A. Find | det U]. 
(b) Find the eigenvalues of B. 
(c) Let B = QR be the QR-decomposition of B. Find | det R]. 


(d) Does there exist the square root C of B? If yes, what are the eigenvalues of C? 


., Àn. Let B 


Bibliography 


Blum, A., Hopcroft, J., and Kannan, R. (2020) Foundations of Data Science, Cambridge University 
Press, UK. 


Garcia, S. R., and Horn, R. A. (2017) A Second Course in Linear Algebra, Cambridge University 
Press, UK. 


Hofmann, T., Schédlkopf, B., and Smola, A. J. (2008) Kernel Methods in Machine Learning. The 
Annals of Statistics 36(3), 1171-1220. 


Langville, A. N., and Meyer, C. D. (2006) Google’s Page Rank and Beyond: The Science of Search 
Engine Rankings, Princeton University Press, USA. 


Strang, G. (2006) Linear Algebra and Its Applications, Fourth Edition, Brooks/Cole, Cengage Learn- 
ing, USA. 


225 


226 


Bibliography 


Appendix A 


Convexity 


Many important and interesting mathematical observations are based on a shared set of special structures 
and properties underlying target mathematical objects. One such example is the convexity of a set or a 


function, which we discuss further here. 


Convexity of a Set 


Consider a subset C of a vector space. We say C is convex if C contains an interpolated vector, Av, + 


(1 — A)va, given a pair of vectors, vı and v2, in C and a positive scalar, 0 < A < 1. 
Definition A.1 A subset C of a vector space is convex if 
Av, + (1—A)v2 EC for any v1,v2 EC and0<A< 1. 


This definition applies to a subset of a vector space. It is thus natural to check whether the scalar 


multiplication and vector addition preserve the convexity. 


Fact A.1 Let Cı and C2 be conver sets anda > 0. Then, both aC, and Cı + C2 are convex. 
Recall that aC, = fav: v € Cy} and Ci + C2 = {v1 + v2 : vı E€ Ch, V2 € Co}. 


Proof: Denote two vectors in @C; as av, and ava, where vj, v2 € C1. Then, for any 0 < à< 1, 
A(avy) + (1 — A)(ave) = a(Avy + (1 — A)v2) € aC}, since Avı + (1 — A)v2 € Ci from the convexity of 
Cı. Let both vı + v2 and w, + w2 be in C1 + C2 where v1, w1 € C1 and vo, w2 € Co. For any 0< À< 1, 
Alvi +2) +(1—A) (wy +we) = (Avi + (1—A)wy) + (Ave + (1—A)we) € C1 +C2, since Av, +(1—A)wy € Cy 
and Avg + (1 — A)we € Cə from the convexity of Cı and Co. | 


Among the set operations, intersection preserves the convexity. You may easily find an example where 


union does not preserve the convexity. 
Fact A.2 Let Cı and C2 be convex sets. Then, Cy N Co is also convex. 


227 


228 Appendix A. Convexity 


Proof: For any V1, V2 € Cı N Cz and0<A<1, Av, + (1 = A)v2 € Cı and Av; + (1 = A)v2 € Ca at the 


same time from the convexity of C; and Cy. Therefore, the fact holds. a 


Convexity of a Function 


A real-valued function f defined on a convex set C in a vector space Y is called a convex function if a set 
B= {(v, r) :vEC,r> f(v)} C V x R is convex. The set E is called an epigraph of f and represents a 
space above the graph of f. This definition of convex function in terms of convex set can be translated 


into an equivalent but more practical description as follows: 
Definition A.2 A real-valued function f defined on a convex set C is convex if 
f (Avi + (1— A)ve) < Af (v1) + (1—A)f (v2) for any vı, v2 EC andO<A<1. 


As the convexity of a function can be characterized in terms of the convexity of its epigraph, Fact A.1 
can be translated to apply to functions, that is, addition and scalar multiplication of functions preserve 


the convexity. 
Fact A.3 Let fı and f2 be convex functions anda > 0. Then, both af; and fı + fo are conver. 


Proof: Let vi,v2 € Cı and 0 < À < 1. (afi)(Avi + (1 — A)ve) < a(Afi(vi) + (1 — A)filve)) = 
Aafi)(vi) + (1 — A)(afi) (v2). (fi + f2)(Avi + (1 — A)va) = fi (Avi + (1 — A)va) + fo(Avi + (1 — A)ve2) < 
Afi(vi) + (1 — A) five) + Afe(vi) + (1 — A) fo(w2) = A(fi + fa)(vi) + A — A)(fi + f2)(v2). a 


Similarly, we can translate Fact A.2 to show that the maximum of two functions, of which epigraph 
corresponds to the intersection of the epigraphs of two original functions, is also convex. This can be 


stated as follows: 
Fact A.4 Let fı and f2 be convex functions. Then, g(v) = max{fi(v), fo(v)} is conver. 
Proof: For two vectors v,,v2 and 0 < A < 1, since fi(v) < g(v) and fo(v) < g(v), 


g(Avit (1—A)v2) = max{f,(Avi + (1—A)ve), Javi + (1 —A)va)} 
< max {Afi(vi) + (1—A)fi(ve), Afo(v1) + (1 — A) fa(v2)} 
< max {Ag(vi) + (1 — A)g(v2), Ag(vi) + (1 — A)g(v2)} 
< Ag(vi) + (1 — A)g(v2). 


We often compose simple functions to produce a diverse set of more complicated functions. The 


composition of convex function and linear functions results in a convex function. 


229 


Fact A.5 Let f be a convex function on a vector space V and £ be a linear transformation from a vector 


space W into V. Then, f o £ is a convex function on W. 


Proof: Let vi,v2 € Cı and 0 < A < 1. (fo 2 (Avi + (1 — A)ve) = f(€(Avi + (1 — A)ve)) = 
f(Al(v1) + (1 = A)e(va2)) < AF (E(v1)) + A = ADF (L(va)) = ACF o O(v1) + (1 = AC o 4)(va)- 
E 


The square function is a basic building block of non-linear convex functions. We can show its convexity 


following a few steps of arithmetics. 
Example A.1 Consider f(x) = x? on R. Then, for 71,72 ER and0<A<1, 


Af (z1) + (1 — A) f (x2) — f(Azı + (1 — A)z2) 
= àrî+(1-— A)r} — (Azı + (1 — Ajr)” 


= Axr? + (1-— A)r? — (a? + 2A(1 — Aziza + (1 — A)? 23) 

= (Ac? — a? — A(1 — Aziz) + ((1 — A)z? — A(L — A)aize — (1 — A)? 23) 
= x(x] — Av — (1 — A)z2) + (1 — A)z2 (£2 — Av — (1 — A)x2) 

= A(1— A)zı (zı — z2) + A(1 — A)z2 (z2 — 21) 

A(1 — A) (£1 — £2)? 


> 0. 


II 


This shows that f is convex. Furthermore, g(x) = g(£1,..., £n) = x? is also a convex function. | 


Convexity of a Quadratic Form 


Assume that f : R™ — R is convex. For an m x n matrix A, if we compose f and a linear transformation 
L(x) = Ax, f o L(x) = f(Ax) is convex by Fact A.5. In the special case of a quadratic function, we relate 


the convexity to the positive definiteness of A, as below. 


Proof: Without losing generality, we may assume that A is symmetric. Then, by item 3 of Fact 7.1, 


we get 
A=B'B 


for some m x n matrix. Consider a function f(y) = y! y = >>7", y?. We know that each y? is convex 
from Example A.1, and the sum of convex functions is also convex by Fact A.3. Hence, f is convex. 
Then, f(Bx) =x'B' Bx =x' Ax is convex by Fact A.5. E 


230 


Appendix A. Convexity 


Appendix B 


Permutation and its Matrix 


Representation 


A permutation ø is a bijective function from and onto {1,...,n}, that is, 0: {1,...,n}— {1,...,n} such 
that o(t) Æ o(j) if i A j. You may think that a permutation shuffles the order of 1,...,n. Therefore, 
its inverse function o~! always exists, and the inverse function is a permutation, too. This implies 
{o : o is a permutation on {1,...,n}} = {071 : ø is a permutation on {1,...,n}}. We can also define a 
n x n matrix associated with a permutation ø so that (i,a(i))-element of the matrix is 1 and all other 
elements are zero in the i-th row. Since (i, ø(i)) equals (o~1(j), j) if we set j = a(i), each column of the 
matrix also has only one non-zero element. This matrix is called a permutation matrix. 

Let us consider an example of size 4, 0(1) = 3,0(2) = 2,0(3) = 4, and o (4) = 1. Then, its permutation 


matrix Q is 


0 0 1 0 

0 1 0 0 
Q= 

0 0 0 1 

1 0 0 0 


You may regard a permutation matrix as a row-shuffled identity matrix. Its inverse permutation is 


o4(1) =4,071(2) = 2,071 (3) = 1, and o™t(4) = 3, whose permutation matrix Q’ is 


oF O © 
Ov Oe oS 
rer = CO CO 
eo Oo Oo Fr 


You can see that Q’ = Q' from this example. That is, the matrix of inverse permutation is the transpose 
of the original permutation matrix. Furthermore, it is easy to see QQ’ = I and Q’Q = I, which implies 


that Q’ is the inverse Q~! of Q. That is, the matrix of inverse permutation is the inverse of the original 


231 


232 Appendix B. Permutation and its Matrix Representation 


permutation matrix. Combining these two relations, every permutation matrix is invertible and its 
transpose is its inverse matrix, that is, every permutation matrix is orthogonal. 

This conclusion holds not only for this small example. To generalize these results, it is convenient to 
borrow a summation representation of rank-one matrices if you are exposed to the rank-one matrix. For 


a permutation øg, its permutation matrix Q can be compactly expressed as 
n 
Q=} eies ep 
i=1 


and 
n 


Q' = (X eieo) = x (eiea) = S esae] = X ejesi) (B.2) 
i=1 j=1 


al i=l 


where the last equality is obtained by replacing j = a(i) and o~!(j) = 0~1(o(i)) = i. Furthermore, 


QQ’ = (Dees) ( ejes) 


= ee, 
Fel 
= I. 
That is, for a permutation matrix Q, 
QQ’ =I, (B.3) 
which also implies 
Q =Q. (B.4) 


Therefore, any permutation matrix is also an orthogonal matrix. 


Appendix C 


The Existence of Optimizers 


A scalar a is an upper bound of a function f : X + R when f(x) < a for all x € X. Let B= {xe R°: 
|x| < 1} be a unit sphere in the d-dimensional Euclidean space. 


Consider the following optimization problem given f : B > R: 


max f(x). 


xEB 


If f is continuous and has an upper bound, we can show that there exists at least one x* € B such that 
f(x) < f(x*) for all x € B. That is, there exists an element in B that takes the maximum value of the 
function f.! If we knew already statements such as “a continuous image of a compact set is compact” and 
“a compact set in R is closed and bounded”, from advanced calculus or topology, it is a trivial result. Here, 
we instead provide a proof based on basic techniques from calculus. First, we start with the following 


lemma. 

Lemma C.1 Ifa real value a is an upper bound of a continuous function f : B —> R, then either 
e there exists x* € B such that a = f(x"), or; 
e for any x E B, either 


1. there exists X € B such that f(x) < f(%)<a and a- f(%) = $(a— f(x)), or; 


2. there exists an upper bound B of f that satisfies 8 — f(x) = (a — f(x)). 
Proof: 
e Let x* be a solution of f(x) = a if it exists. Otherwise, the condition means f(x) < a for all x € B. 


e In the latter case, f(x) < a for any x € B. Consider the mid-point between a and f(x), 8 = 
fÈ) + $(a— f(&)), and the corresponding equation f(x) = £. 


1This is not always the case, since for instance there is no element within (0,1) = {x E€ R: 0 < x < 1} that takes the 


maximum value of f(x) = x. 


233 


234 Appendix C. The Existence of Optimizers 


1. If a solution exists, let it be x. Then, f(x) = 6, which implies that 


a-f) =a-B=0- F() — Ela- A) = F(a - A). 
We also see that f(x) > f(X). 


2. If there is no solution of f(x) = 8, 8 is another upper bound of f, as f is a continuous function, 


and 


E 
Assume that a is an upper bound of a continuous function f over B. Starting from xọ = 0 € B and 


Qo = a, we can iteratively find x, € B and upper bounds ax, for k = 1,2,... that satisfy 
1 
ak — f (Xk) = 5 (0r — f(xr-1)) (C.1) 


by repeatedly applying Lemma C.1. 
If we found x* € B such that f(x*) = a, for some k, x* touches an upper bound and is a solution to 
f(x*) = maxxep f(x). We instead assume that there is no solution to f(x) = a, for all k among x € B. 


We update along the second bullet of Lemma C.1 as follows: 
e case 1: keep the upper bound ax = ag_ 1, and update x, as x; 
e case 2: keep Xk = Xķk—1, and update upper bound a, as p. 


If we do not obtain the solution in finite k, by (C.1), the upper bound sequence satisfies 
1 
0 < ap-1— Qk < 56 (27 (0). 

Then, limk Qk = a + Xp (Qk — &k—1) = & exists by the comparison test in calculus course. For 
any x € B, f(x) < ax implies that f(x) < limg..a, = ¢, which means that @ is also an upper 
bound. Furthermore, liMk—oo (ak —f (xx)) = 0 holds from (C.1). Combining two limits together, we get 
limp-yoo f(x~) = £ That is, we get a sequence x, E€ B whose function values converging to an upper 
bound £. But, we don’t know yet whether x, itself converges. 

To analyze {xp : k = 1,2,...}, consider the d-dimensional cube Co = [-1,1]¢ C R? containing 
the unit sphere B. Bisecting each edge of Co generates 2% smaller cube consisting of edges of length 
1, whose union recovers Cp. Since the infinite sequence x, is scattered in the union of 2% cubes, and 
one cube contains infinite terms of the sequence. Say this cube as C1. We re-index x,’s in C4 as a 


subsequence x), i = 1,2,... while preserving the order of the original sequence. To C1, we repeat the 


4 


procedure again to get a cube C2 of 1/2-edge length containing infinitely many xfs, Re-index xs 
in Cy as x), i = 1,2,.... If we repeat this procedure infinitely many times, we get a sequence of cubes 
Co D C1 D+- D Cj D +--+ where each cube Cy contains infinitely many x;’s under the name of xi)°s 


such that {x0 4c ge. 


235 


We choose diagonal terms {xP}, the 2-th term in the i-th subsequence. It is important to noticing 
x) € C} for all i > j from the monotonicity of cubes. Take the first coordinate of x € R? and call 


yi. Observe that |yj+ı — yj| < (ay since both SER and x ) belong to C; whose edges have length 
of Oa If we apply the comparison test in calculus course, the sequence y; = yı + X12} (yi1 — yi) 
converges. Following the same arguments to other coordinates, we can conclude that there exists x* € R? 
such that je —x*|— 0. In addition, from k| < 1, |x*] < |x* — a] + | < |x* - x)| + 1 holds for 
all j. Taking a limit on j, we narrow the location of limit vectors x* within B. On the other hand, since 
FEP) is a subsequence of f(x) converging to £, we observe that ¢ = limk f (Xk) = limj—oo f(x). 


Furthermore, lim;_,.. f ecg )) = f(x*) since f is a continuous function, which implies ¢ = f(x"). 
Lemma C.2 If a continuous function f : B — R has an upper bound, there exists an x* € B such that 


f(x*) = max f(x). 


xEB 


If we apply Lemma C.2 to a optimization problem to maximize — f where f is a continuous function 
with a lower bound, then we can find a minimizer of f on B. In addition, it is very useful fact in dealing 


with optimization formulations that, for any sets B and C, 
“if x* € BCC, f(x*) < max f(x) < max f(x) holds.” 
xEB xEC 


We borrow this fact at many places without any explicit mentions. 


Let us apply Lemma C.2 to a maximization of the norms of linearly transformed vectors. 


Lemma C.3 Let A be an n x d matrix. Then there exists a unit vector x* such that 


|Ax*| = max |Ax| = max|Ax|. 
Ix|<1 ix|=1 


Proof: If A =0, the equalities hold trivially. Therefore, let us assume that A £0. Then the maximum 
value is positive. 

Consider the first equality. Since the function |Ax| is continuous, we can apply Lemma C.2 once we 
show that the function |Ax| has an upper bound on B. Let |x| < 1 and denote i-th row of A as a} . Then, 
by the Cauchy-Schwartz inequality, 


n 


n n 
|Ax|? = SU (az x)? < So lail?lx!? < SO Jail? = Ali 
t=] i=l 


i=1 
and we can conclude that the function |Ax| has an upper bound ||A||F. 


For the second equality, let x* be the vector satisfying the first equality. If |x*| = 0, |Ax*| = 0 


contradicts to A 4 0. If |x*| < 1, then the unit vector y = -4,x* provides a bigger value |Ay| > |Ax*|, 


[x*| 


which is a contradiction. Hence |x*| = 1 and this implies the second equality. a 


We extend Lemma C.3 further to an optimization problem with orthonormal conditions. 


236 Appendix C. The Existence of Optimizers 


Lemma C.4 Let A be ann x d matrix and {v1,..., Vg} be orthonormal. Then there exists a feasible x* 


satisfying |x*| =1,v, L x*,..., Vk L x* such that 
|Ax*| = max{|Ax| : |x| < 1,v, Lx,...,v, L x} = max{|Ax|: |x| = 1,vı L x,...,v, Lx}. 


Proof: Assume k < d. Expand the k orthonormal vectors such that {v1,...,Vk,Vk+1---,Va} to be an 
orthonormal basis by Gram-Schmidt procedure. Denote V = [vi |..., | va], the matrix whose columns 


are v;’s. If y is the coordinate vector with respect to the new basis, x = Vy holds. Note that v; L x is 


equivalent to y; = 0 since v] x = v] Vy = y;. Furthermore, |x| = Vx!x = /y'V Vy = v/y'y = yl. 


4 


Set A to be the last (d — k) columns of AV and F = (Yk+1,--., Ya)! - Then, the following set equalities 
{|Ax| : x] < lyri Lx,..., v4 Lx} = {|AVy| : ly] < 1,1 =0,.-., ye =O} = MAR: Fl < 13 


imply 
max{|Ax| : |x| <1,v, L x,..., Vk L x} = max{|Ajy] : |¥| <1}. 


ý 0 
By applying Lemma C.3 to max{|Ay| : Işi < 1}, we get a unit vector y* and y** = n satisfying 
y 


max{|A¥| : |¥| < 1} = |Ay*| = |AV*™*| = | Ax*| 


where x* = Vy* with |x*| = |y*| = 1 and vi L x*,...,v, L x*. This proves the first equality. Since 
|x*| = 1, the second equality holds. a 


Appendix D 


Covariance Matrices 


A vector whose entries are random variables is called a random vector. That is, a d-dimensional random 
vector is denoted by X = (Xj,...,Xq)' where each X; is a random variable. The expectation of a 


random vector is computed element-wise as 


Xı X [Xa] 
, E[X] =E : = : = (E[X1],...  E[X])" ER. 


Xa Xa [Xa] 


In general, the expectation of a vector or a matrix with random variable entries are defined as a vector 
or a matrix of the same size whose random variables are replaced by their expectations, respectively. 

A variance as well as a mean are basic statistics for a random variable. To measure the variability of 
random vector from the mean vector, we extend the definition for random variables to covariances of all 
combinations of random variables in a random vector. The covariance between X; and X; are recorded 


at the (i, j)-th entry of a matrix called a covariance matrix, that is, E| (X; — E[X;]) (X; — E[X;])] is the 


(i, 7)-th entry of a covariance matrix. This covariance matrix is conveniently described by an expectation 


of ad x drandom matrix, 


(X — E[X])(X - EIX)" = ((X; — E[Xi]) (Xj - EIX) - 


This matrix is of rank-one and symmetric since ((X — E[X])(X — [X]) 7)" = (X — E[X])(X — E[X])'. 


The symmetry is preserved under the expectation. However, taking expectation of this rank-one random 


matrix averages statistically independent rank-one matrix samples and the rank of covariance matrix 


increases. covariance matrix is denoted as 


£ = Cov(X, X) = E[(X — E[X])(X — E[X])"]. 


If E[X] = 0, the covariance matrix is simply described as © = E[XX"]. By shifting a random vector by 


its mean vector like X — E[X], the shifted random vector has zero vector as its mean vector. 


237 


238 


Appendix D. Covariance Matrices 


D.1 Covariance Matrix of Random Vector 


Let us investigate the covariance of random vector X = (Xj,.. 


.,Xq)'. For simple exposition, set 


Hi = E[X;]. The squared random variabilities are collected in a rank-one matrix, 


(X — E[X])(X — E[x])" 
(Xi = m)’ 


(Xa — pa)(X1 — mı) 


(Xi — p1)(X2 — u2) 


(Xə — p2)(Xı — mı) 


(X2 — p2)? 


If we take an expectation of this random matrix, 


£ = Cov(X, X) = E[(X — E[X])(X - 


(X1 = m)’ 
(X2 — u2)(X1 — mı) 


(Xa — Ma)(X1 — m) 


2[(X1 — p41)? 
2[(X2 — p2)(Xı — mı )] 


[| (Xa — Ma)(X1 — mı )| 


o[X]) ] 


(X1 — m)(X2 — u2) 


(X2 — p2)? 


(Xa — Ha)(X2 — n2) 


Var(Xı) Cov(X1, X2) 


Cov(X2, X1) Var(X2) 


Cov(Xa,X1) Cov(Xa, X2) 


(Xa — pa)(X2 — u2) 


D[(X1 — p1)(X2 — p2)] 
[(X2 — p2)?| 


(Xa — fba)(X2 — u2)] 


Cov(Xı, Xa) 
Cov( X2, Xa) 


Var(Xa) 


In particular, if E[X] = 0, the covariance is described as 


X = Cov(X, X) = E[XX"] 


xf 
KG 


XaXı 


[xF] 


[X2 X1] 


a [XaXı] 


XiX 
X3 


XqX2 


1X1 Xo] 
[X3] 


{|X aX] 


(X1 = p1)(Xa — Ha) 
(X2 — u2)(Xa — pa) 


(Xa — pa)? 


(Xı — m )(Xa — Ha) 
(X2 — u2)(Xa — Ha) 


(Xa — pa)? 


i [(Xa — m )(Xa — na )] 
i [(X2 — pl2)(Xa — Ha) 


[(Xa — Ha)? 


D.1. Covariance Matrix of Random Vector 239 


D.1.1 Positive Definiteness of Covariance Matrices 


For a d-dimensional random vector X and an arbitrary y € R, y'X = X'y is a random variable. Its 


square has alternative expressions of 


y' XX'y = (y'X)(X'y) =(y'X)’. 


Using this, the quadratic form induced from © is 


y' 2y =y E[XX']y =Ely’XX'y] =E[(y'X)’] > 0, 


which shows that X is positive semi-definite. If X is not positive definite, there exists some y such that 


‘[(y 1 X)?] = 0 and hence y'X = 0 with probability 1. That is, random vector X is linearly dependent 


and X can not admit a d-dimensional probability density function. 


D.1.2 A Useful Quadratic Identity 


Consider a block matrix 
A A 
Voe T 12 
Aj Ago 


consisting of matrices A11, A12, and Age of sizes nı X n1, N1 X Ng, and no x Ng, respectively. Assume that 


Aj, and Age are symmetric and invertible. Then A is symmetric. 


u 
u; € R”! and ug € R”? are also given. For u = j , let us compute the following quadratic difference 
u2 


wu! A~tu— u Azzu. 
Assuming the invertibility of the Schur complement S11 = A11 — Aiz Az Als as in (2.7), we obtain 


ca -Su Åp Aza 
-Az Ala Sir Az T Azz Ala Sy A124337 


url 2 


We can develop the following quadratic form using this inverse representation as 


aly zj: 
= ui Ay A12 ui 
ul A`tu = 3 
ug Ais A22 u2 


= ul Si Wy = Qu} SR A12 A592 T ul (Age . + A539 Ala SR A12A59 ) U2 


T =l TO= =1 oF = T LAT 1 1 
= Uy S11 uy — 2u; Sii Aj2A55 u2 T Uy Aga Ug + Uy Ad5 A2511 A12 A33 u2. 


Then the quadratic difference becomes as simple as 


240 Appendix D. Covariance Matrices 


= Togs = 
= (uy = A1243 U2) Sa (uy ieas A1243 u2) . 
We re-arrange it as a quadratic identity of 
3 = z Tas = 
ul A lu = u} A22 lug + (u SEE A1243 u2) S (u = A124337 U2) 5 (D.1) 


which is a key to obtain a conditional density for a multivariate Gaussian distribution. (Refer Appendix 
D.3.) 


D.2 Multivariate Gaussian Distribution 


Let X be a random vector. y= E[X] and © = E[XX! ] are its mean vector and covariance matrix, 


respectively. Assume that a multivariate probability density function f(x) to describe the likelihood of a 


random vector X is given as a function of 
(x-y) E(x- u). 


For an appropriate function g(-), 


is a density of a so-called elliptical distribution, the family of which includes the multivariate Gaussian 
distribution and the multivariate t-distribution. If the function g is a decreasing function, the level set, 


{x € R! : f(x) > a} of the density function is characterized by the following ellipsoid 
fe ER? : (x— p) E- p) <a} 


for some positive a (Refer Section 7.5 ). This simple ellipsoidal geometry of level sets provides various 
ideas in the analysis of high-dimensional data. 


For a d x d positive definite matrix X, the following definite integral is well-known: 
‘i e7 BH) EW) dy = (27)¢ det X. 
Rd 


Once normalizing the left hand side with the constant on the right hand side, we obtain a multivariate 
density function fu,x on Rt given by 
1 


e7 RO M)TE Na), 
(27)4 det X 


fus(x) = 


which satisfies 
fu.s(x) > 0 and i; fs (x)dx = 1. 
Re 


1 As introduced in Section 7.5, y(x — w)'5-1(x — p) is a Mahalanobis distance defined by a positive definite matrix © 
and a center point u. The Mahalanobis distance can be interpreted as a statistical distance denominated by the standard 


deviation. 


D.3. Conditional Multivariate Gaussian Distribution 241 


This distribution is called multivariate Gaussian distribution. It can be shown that the mean 
vector and covariance matrix of fy,» are u and X, respectively. Since the Gaussian distribution is fully 
characterized by the mean and covariance, we simply denote it as N(y,x). It is often necessary to 
know the conditional density function of multivariate Gaussian in many applications, the derivation of 
which is non-trivial. It is provided by combining results on block matrices and determinants in Appendix 
D.3. In addition, Appendix D.4 explains how to generate random samples of the multivariate Gaussian 


distribution based on the Cholesky decomposition. 


D.3 Conditional Multivariate Gaussian Distribution 


Let X, be an n-dimensional random vector, Xə an ng-dimensional random vector, and ww, E€ R”™ and 


Hy E€ R”? their mean vectors, respectively. The covariances are given as Dy, = E[(X1 — mı )(Xı — mı) "], 


Yio = E[(X1 — p,)(X2 — py)", and Nag = E[(X2 — pra)(X2 — pa)"]. Define the augmented random 


X 
vector as X = "| and its mean vector H= a . The covariance of X is 
Xə H2 
z 
; | | Xam u| | Xa m Xi X12 
X =E[KX- uX- n)'] = = 
Xə — ma | | X2 — po X21 Lge 


where N51 = X12. If we set uy = X4 — Hi, U2 = X2 — Ho, and A;; = X;j and plug them into (D.1) with 


a notation 


=1yT 
D = Xii — Uy 2ho9 %12, 


we obtain a decomposition 


(x-y) E(x- u) = (x2 — Hy)! £227" (x2 — Ma) 
+ (xı — (p + ¥12¥227 (x2 — H2) )) t(x — (py + ¥12¥£227 (x2 — py) )) - 


To find a conditional density of the multivariate Gaussian, we have to simplify tte 
21222 


decomposition of quadratic terms helps us simplify the exponents of the conditional density. (8.4) lets us 


. The above 


know det © = det £z det È. Therefore, the conditional multivariate Gaussian density is given by 


fur) _ (27)"2 det 222 k(x poy TE- xp) +4 (x2 Ha) T Eg (22 Ha) 
fu, £22 (x2) (Q7)ritne det & 
= oe oe [+2 125227 (x2— py) )) È t (x1 -( +212 D227 1 (x2) )) ; 
(Qn) det È 


As we can notice from the density function, this conditional distribution itself is again a multivariate 


Gaussian distribution with mean u + Z12222 (X2 — py) and covariance È = Xj — Sige ah It is 


usual to call u + Z12222 (x2 — py) a conditional mean and X11 — X12 Do Nhy a conditional covariance. 


The conditional covariance is in a form of the Schur complement. 


242 Appendix D. Covariance Matrices 


D.4 Multivariate Gaussian Sampling using Cholesky Decompo- 
sition 


It has a long history to generate random samples from statistical distributions. One of the simplest 
random sampling is to choose an integer uniformly among {0,1,2,...,p—1} for a prime number p. If we 
divide the generated numbers by p, we get approximate random samples uniformly on [0, 1]. As a random 
sample on [0,1], bigger prime p results in higher quality samples on [0,1]. Once we have random samples 
uniformly on [0,1], we can generate random samples from a distribution with a cumulative distribution 
function F by composing the [0,1]-sample with F~'. The exponential distribution is a well-working 
example of this approach. However, it is difficult for many distributions to find the inverse functions in 
the form of easy evaluation. The dependence structures for multivariate distributions are another source 
of difficulty for random sampling. It is not easy to get random samples from the multivariate Gaussian 
as well. 

However, there are many libraries to provide high quality random samples from N (0, 1) or equivalently 
N(0,I) because of its popularity. Hence, we start with independent random samples from N(0,J) to 
obtain random samples from N(y, =). Assume that the covariance © has a Cholesky decomposition of 


£ = R! R. For this factor R, we define a random vector by 
X=R'Z 


where the random vector Z follows N(0, I), the d-dimensional standard multivariate Gaussian distribu- 


tion. Then, 


e its mean vector is E[X] = R'E[Z] = R'0=0; 


e its covariance is 


[XX]! ] = E[R' ZZ" R] = R'E[ZZ' |R = R'IR =X 


since [ZZ | S 


Hence, the random vector X shares its mean and covariance with N (u, X), but we do not know what 
is the distribution of X yet. A key property for this characterization is that a linear combination of 
Gaussian random variables is again Gaussian. Therefore, each element of R'Z is also Gaussian, which 


in turn implies that the random sample X follows N (js, ©). 


D.5 I[ll-conditioned Sample Covariance Matrices 


Inversion of sample covariance matrices is a popular task in Statistics and machine learning. From 


d-dimensional samples x;,...,X,,, an unbiased estimator of their covariance matrix is 


b= bee 


D.5. Ill-conditioned Sample Covariance Matrices 243 


which is desired to be positive definite. In practice, the inversions often fail in some reasons. One of 
them is that È is not guaranteed to be positive definite, but semi-definite. Another possibility is that 
numerical errors in the inversion operations destroy the invertibility of positive definite matrices if they 
have very small positive eigenvalues. In any cases, we perturb the estimator by a small positive number? 
€, and use È + eI instead. A direct check using the definition of eigenpair or Fact 7.7 confirm that every 


eigenvalue of Š increases by e through this perturbation. 


20.01 is a popular choice of € in practice. 


244 Appendix D. Covariance Matrices 


Appendix E 


Complex Numbers and Matrices 


A complex number is defined by two real numbers. For this definition, we need a special complex number 
i called an imaginary unit satisfying i? = —1. For two real numbers a and b, a + ib is a complex 
number. So, i is a complex number corresponding to a pair of 0 and 1. We denote the set of complex 
numbers as C and the set of vectors with n complex entries as C”. In arithmetic of complex numbers, i 
may be regarded as a symbol like a real variable with a notational convention: a + i(—b) = a — ib and 


a+i(1) =a+i. The complex conjugate of a + ib is a — ib and also denoted by a + ib. When b = 0, 


a +ib = a is areal number. Basic arithmetic of complex numbers are listed. 


e (ay a ibs) =e (az ib2) = (ay E a2) + i(by x b2) ; 


e (ay = ib) x (ag + iba) = (ajaz E bbz) + i(aıb2 + biaz) ; 


lai +ib1|? = (ay +ib1) x (a1 + ib;) = (a1 +ib1) x (a1—ib1) = a? +b? = 0 and |ay+ib;| = Jat + b? ; 


e (a +ib)? = pa l ib) = p ia fla + ibl 40. 


Note that the conjugation of real numbers does not alter the real numbers. The operational order of 


arithmetic and conjugation can be reversed. For 21,22 € C, 


e (a, + ibı) + (a2 + ibe) = (a1 + ibi) + (a2 + ibg), that is, 27 E 22 = z1 £ 22; 


e (ai T ib;) x (ag ia iba) = (ay oa ibi) x (a2 ap ibe), that is, Z1° 22 = Z1: 2. 


A complex matrix is a matrix with complex numbers as its entries. For complex numbers, a conjugate 
transpose corresponds to the transpose of real matrices, defined as A" = (aj) for A = (aij). Similarly 
to the symmetry A! = A, A is a Hermitian matrix if A = A. Notice that A" = A! for real matrix A. 
On the other hand, z" = Z if we regard z € Cas a 1 1 matrix. For z € C, z € C”, complex matrices A 


and B, 
°z=2 


? 


245 


246 Appendix E. Complex Numbers and Matrices 


e A+B=A+B,AB=AB; 
e (A+ B)4 = AN + BY (AB)H = BH AH 


where we assume that the matrices are well-sized such that arithmetic is well-operated. 


Few useful facts are ready. 
Fact E.1 A complex number z is a real number if and only if z = z". 


Proof: For z =a + ib, Z = a — ib. Then, b = 0 is equivalent to z = Z. a 


Fact E.2 If A= A", then for all complex vectors x, x" Ax is real. 
Proof: Using Fact E.1, (x} Ax)" = x4 AM(x4)4 = x4 Ax implies that x" Ax is real. | 


The standard inner product in the vector space C” is defined as (u,v) = u"v for u,v € C”. Most 


properties of the standard inner product in R” including bilinearity are preserved except (u,v) = (v,u). 


Appendix F 


An Alternative Proof of the Spectral 


Decomposition Theorem 


We provide an alternative proof of Theorem 5.2 without relying on SVD. 


(The Real Spectral Decomposition) Let A be a real symmetric matrix. Then, A is orthogonally 


diagonalizable. That is, 


A= VAV! = we viv) 5 
=l 


where V is an orthogonal matrix with orthonormal columns v1, V2,...,Vn, |Vil = 1 and A = 


diag(à1,.--, Àn). 


Proof: Let e; = (1,0,...,0)'. Because A is symmetric, there is at least one eigenpair (A, v) where À 


is real and |v| = 1. For such an eigenpair, there exists a real orthogonal matrix Q that satisfies Qv = e1 


and Q7! = Q'. Since Av = dv, 


QAQ'e; = QAv = AQv = dey. 


A 
The first column of QAQ' is , and QAQ! is symmetric. We can thus express it as 
0 


Ce el 
0 An-1 


where A,_1 is an appropriately chosen (n — 1) x (n — 1) symmetric matrix. 


1We use the Gram-Schmidt process to begin from v and successively produce orthonormal vectors, v2,... 


Q! = [v|v2|--- | vn] satisfies Qv = e1 and QQ! =I. 


247 


Vn. Then, 


248 Appendix F. An Alternative Proof of the Spectral Decomposition Theorem 


Assume there exists an (n—1) x (n—1) real orthogonal matrix Qn—1 such that Qn-14n-1Q}—1 = An-1, 


1 T 
for this (n — 1) x (n — 1) symmetric matrix An—1. Let Qn = Q . Then, Qn is an n x n real 
n-1 
. : A o" T A 0! Tayi -s x 
orthogonal matrix, and satisfies Qn Qe . Thus, QnQAQ Q, is diagonal, as 
0 An-1 0 An—1 
ToT T A OF 
QnQAQ’ Qn = (QnQ)A(QnQ) = rs ok =A. 
n—1 


QnQ is orthogonal because both Qn and Q are orthogonal. If we let V = (QnQ)', we now see that 
A=VAV". Finally, by (3.6) in Corollary 3.1, we get 


VAV! =X Aviv, . 
w=1 


Index 


adjoint, 185 direct sum, 28 
algebraic multiplicity, 177 dual basis, 184 
argmax, argmin, 85 dual problem, 184 


dual variable, 184 
back-substitution, 11 


basic vector, 37 echelon form, 31 

basis, 37 Eckart-Young—Mirsky theorem, 108 
block matrix, 12 eigen-decomposition, 101 

Brauer theorem, 197 eigenpair, 101 


eigenvalue, 101, 167 


Cauchy-Schwarz inequality, 60 eigenvalue adjustment, 197 


Cayley-Hamilton theorem, 199 eigenvalue interlacing, 142 


change of basis, 173 eigenvector, 101, 167 


Cholesky decomposition, 136 ellipsoid, 143 

cofactor expansion, 157 elliptical distribution, 240 

column space, 28 

condition number, 100 free variable, 32 

conditional covariance, 241 Frobenius norm, 83 

conditional mean, 241 fundamental theorem of symmetric matrices, 177 
cone, 190 


Gaussian elimination, 11 
congruence transformation, 189 : 

Gaussian kernel, 146 
convex, 190 f f 

generalized eigenvector, 205 


convex cone, 190 ; zefi 
generalized projection, 113 

cosine similarity, 66 ; ti 
geometric multiplicity, 177 


covariance, 237 . 
Google matrix, 198 


covariance matrix, 237 ; 
Gram-Schmidt Procedure, 75 


Cramer’s rule, 162 
Hermitian, 245 


detomminane tepieize Householder matrix, 71 


diagonal, 7 
diagonalizable, 175 idempotent, 187 
dimension, 38 identity, 6 


249 


250 Index 


inner product, 60 orthogonally diagonalizable, 176 
inverse, 14 orthonormal, 66 


isometry, 82 
PageRank, 179 


joint diagonalization, 189 PCA, 115 

Jordan (normal) form, 207 Penrose identities, 109 

Jordan block, 206 Perron-Frobenius theorem, 196, 198 
Jordan normal form theorem, 207 pivot, 30 


pivot variable, 32 


kernel trick, 144 i 
polynomial kernel, 145 


latent space, 123 positive definite, 63, 134 

least square, 85 positive semi-definite, 134 

left singular vector, 94 principal component, 115 

linear combination, 27 principal components analysis, 115 
linear functional, 183 projection, 53, 72, 74, 113 

linear transformation, 48 pseudoinverse, 109 


linearly dependent, 35 

: QR-decomposition, 81 
linearly independent, 35 

i quasi-Newton method, 165 
low rank approximation, 107 


lower triangular, 11 random vector, 237 


LU-decomposition, 18 rank, 40 


rank-one, 46 
Mahalanobis distance, 144 


; Rayleigh quotient, 137 

Markov matrix, 195 

f real spectral decomposition, 101, 178, 247 
matrix, 1, 2 

; f reduced row echelon form, 31 
matrix determinant lemma, 159 

. . right singular vector, 94 
matrix exponential, 179 

; rotation matrix, 53 
matrix norm, 83 

row echelon form, 31 

minimax principle, 139 


MNIST, 124 Schur complement, 21, 159, 193, 239 
multivariate Gaussian distribution, 240 Schur triangularization, 194 


self-adjoint, 185 
nilpotent, 200 self-adjoint, 


Sherman-Morrison formula, 163 
norm, 60 


similar, 175 
null space, 29 emcee 

singular value, 94 
orthogonal, 66, 68 singular value decomposition, 99 
orthogonal complement, 68 singular vector, 94 


orthogonal matrix, 80 span, 36 


Index 


spectral norm, 83 

spectral radius, 191 

subspace, 27 

SVD, 99 

symmetric, 7 

symmetric positive definite kernel, 145 
symmetric rank-one update, 164 
symmetric sum, 192 


symmetrization of matrix, 105 


trace, 83, 195 
transpose, 7 


triangularization, 194 


unit vector, 60 
unitary matrix, 194 


upper triangular, 11 


vector, 2 

vector space, 27 

volume of ellipsoid, 144, 172 
volume of parallelopiped, 160 


Weyl’s inequality, 140 
Woodbury formula, 163 


251 


